--- categories: - "" - "" date: "2020-09-14T22:42:51-05:00" description: Regression and Prediction Model of AirBnB Prices in Copenhagen draft: false image: copenhagen.jpg keywords: "" slug: blog4 title: How much does it cost to sleep in Copenhagen? subtitle: Various analyses ---
The purpose of this report is to explore data on AirBnB listings in Copenhagen and understand the key drivers of price variability, specifically focused on the price for a 4-night stay for 2 guests.
What is the best model to predict overall price of AirBnB accommodation in Copenhagen for 2 people wishing to spend 4 nights in the city?
The research is based on conducting Exploratory Data Analysis, followed by Geospatial Analysis to gain more context on the variables of interest and their spatial, as well as statistical, distributions, before building, evaluating and ultimately selecting an optimal regression model to predict the price of an AirBnB stay in the city.
This chapter presents the very first steps we took to tidy the data and understand its structure.
The data needed wrangling, since quite a few variables were not of the appropriate data type. Furthermore, a significant number of variables were either incomplete or not readily interpretable. We: adjusted the levels of some variables, clustering and recategorising them; created new cleaned variables; ensured NA values were correctly coded as such; and imputed NA’s using our judgement, wherever necessary.
After having selected meaningful variables, summary statistics were created and then visualized in tables, density plots, bar plots and correlation matrices. In doing so, we found some correlations between review-related variables and listing type-related variables.
This chapter describes geospatial analyses conducted by using the leaflet package. These mapped visualisations were created in order to glean a more holistic perspective not only of the statistical distribution of listings across certain variables and levels, but the spatial distribution of listings with certain characteristics (such as price and type). Our maps include: Nightly Prices in Copenhagen, a Clustered map of AirBnB accommodation, and a comprehensive Map of nightly price, grouped by property type. We summarise some of our key geospatial findings below:
Nightly Prices in Copenhagen Map: listings are highly priced in central areas, especially on the inland waterways and Eastern coastline, such as the lakes and Amager
Clustered map of AirBnB accommodation: the vast majority of listings, as we would expect, are in the city center and in key areas around the center such as Norrebro, Osterbro, Vesterbro and Fredriksberg
Map by property type: the most common property type is Apartment; Apartment and Condominium prices decline with distance to the city center while House and Townhouse prices are not overtly correlated with centrality
This chapter focuses on describing the process behind building our optimal model, guiding the reader through 7 preliminary models before identifying, specifying, testing and comparing linear regression model performance. We conclude by identifying a model which we find to explain the greatest propotion of variance in price of AirBnB listings while upholding core linear regression assumptions and showing the least overfitting of all tested models.
Our Best Model
In summary, our “best”, or optimal, model was generated via the following core steps:
As first step we load the required packages for all stages of our analysis.
Next, we load in the Copenhagen AirBnB data, which has been scraped from the AirBnB website using vroom, which is a package allowing for fast reading of large datasets such as ours. We also clean the variable names as we import the data, to allow us to push on with our EDA without extensive renaming.
We conduct a thorough Exploratory Data Analysis (EDA), using questions to guide our investigation. In this phase of our analyses, we not only attempt to gain an understanding of our data, but also manipulate or ‘wrangle’ it so that each variable behaves as it should, and there is no unnecessary data missingness or structural problems. This allows us to dive deep into our data and uncover hidden relationships without fear of unexpected errors and without being constrained by poor data coding, since our dataset was scraped from AirBnB’s public website.
Our Goals During EDA
Our primary focus during EDA is to develop an understanding of the data. We are guided by the following questions for our investigations:
We use glimpse to have a first look at the dataset, in order to evaluate all columns and the data types.
#we glimpse the initial dataset
glimpse(listings)
## Rows: 28,523
## Columns: 106
## $ id <dbl> 6983, 26057, 26473, 29...
## $ listing_url <chr> "https://www.airbnb.co...
## $ scrape_id <dbl> 2.02e+13, 2.02e+13, 2....
## $ last_scraped <date> 2020-06-28, 2020-06-2...
## $ name <chr> "Copenhagen 'N Livin'"...
## $ summary <chr> "Lovely apartment loca...
## $ space <chr> "Beautiful and cosy ap...
## $ description <chr> "Lovely apartment loca...
## $ experiences_offered <chr> "none", "none", "none"...
## $ neighborhood_overview <chr> "Nice bars and cozy ca...
## $ notes <chr> NA, NA, NA, NA, "Pleas...
## $ transit <chr> "Bus 66 runs to the ce...
## $ access <chr> "Bedroom, living room,...
## $ interaction <chr> "We are usually at wor...
## $ house_rules <chr> "No smoking allowed! N...
## $ thumbnail_url <lgl> NA, NA, NA, NA, NA, NA...
## $ medium_url <lgl> NA, NA, NA, NA, NA, NA...
## $ picture_url <chr> "https://a0.muscache.c...
## $ xl_picture_url <lgl> NA, NA, NA, NA, NA, NA...
## $ host_id <dbl> 16774, 109777, 112210,...
## $ host_url <chr> "https://www.airbnb.co...
## $ host_name <chr> "Simon", "Kari", "Oliv...
## $ host_since <date> 2009-05-12, 2010-04-1...
## $ host_location <chr> "Copenhagen, Capital R...
## $ host_about <chr> "I'm currently working...
## $ host_response_time <chr> "N/A", "N/A", "within ...
## $ host_response_rate <chr> "N/A", "N/A", "100%", ...
## $ host_acceptance_rate <chr> "33%", "19%", "100%", ...
## $ host_is_superhost <lgl> FALSE, FALSE, FALSE, F...
## $ host_thumbnail_url <chr> "https://a0.muscache.c...
## $ host_picture_url <chr> "https://a0.muscache.c...
## $ host_neighbourhood <chr> "Nørrebro", "Indre By"...
## $ host_listings_count <dbl> 1, 1, 4, 1, 1, 1, 3, 1...
## $ host_total_listings_count <dbl> 1, 1, 4, 1, 1, 1, 3, 1...
## $ host_verifications <chr> "['email', 'phone', 'r...
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE...
## $ host_identity_verified <lgl> FALSE, FALSE, TRUE, FA...
## $ street <chr> "Copenhagen, Hovedstad...
## $ neighbourhood <chr> "Nørrebro", "Indre By"...
## $ neighbourhood_cleansed <chr> "Nrrebro", "Indre By",...
## $ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA...
## $ city <chr> "Copenhagen", "Copenha...
## $ state <chr> "Hovedstaden", "Hoveds...
## $ zipcode <dbl> 2200, 2100, 1210, 1650...
## $ market <chr> "Copenhagen", "Copenha...
## $ smart_location <chr> "Copenhagen, Denmark",...
## $ country_code <chr> "DK", "DK", "DK", "DK"...
## $ country <chr> "Denmark", "Denmark", ...
## $ latitude <dbl> 55.7, 55.7, 55.7, 55.7...
## $ longitude <dbl> 12.5, 12.6, 12.6, 12.6...
## $ is_location_exact <lgl> TRUE, TRUE, TRUE, TRUE...
## $ property_type <chr> "Apartment", "House", ...
## $ room_type <chr> "Private room", "Entir...
## $ accommodates <dbl> 2, 6, 12, 2, 4, 3, 3, ...
## $ bathrooms <dbl> 1.0, 1.5, 2.5, 1.0, 1....
## $ bedrooms <dbl> 1, 4, 6, 1, 3, 1, 1, 2...
## $ beds <dbl> 1, 4, 7, 1, 3, 3, 2, 2...
## $ bed_type <chr> "Real Bed", "Real Bed"...
## $ amenities <chr> "{TV,\"Cable TV\",Wifi...
## $ square_feet <dbl> 97, NA, NA, NA, NA, 68...
## $ price <chr> "$365.00", "$2,398.00"...
## $ weekly_price <chr> NA, NA, "$17,513.00", ...
## $ monthly_price <chr> NA, NA, "$67,073.00", ...
## $ security_deposit <chr> "$0.00", "$5,000.00", ...
## $ cleaning_fee <chr> "$33.00", "$1,100.00",...
## $ guests_included <dbl> 1, 3, 1, 1, 1, 2, 2, 2...
## $ extra_people <chr> "$66.00", "$350.00", "...
## $ minimum_nights <dbl> 2, 3, 3, 7, 7, 2, 3, 6...
## $ maximum_nights <dbl> 15, 30, 31, 14, 31, 10...
## $ minimum_minimum_nights <dbl> 2, 3, 3, 3, 7, 2, 3, 6...
## $ maximum_minimum_nights <dbl> 2, 3, 3, 5, 7, 2, 3, 6...
## $ minimum_maximum_nights <dbl> 15, 30, 1125, 14, 1125...
## $ maximum_maximum_nights <dbl> 15, 30, 1125, 14, 1125...
## $ minimum_nights_avg_ntm <dbl> 2.0, 3.0, 3.0, 4.1, 7....
## $ maximum_nights_avg_ntm <dbl> 15, 30, 1125, 14, 1125...
## $ calendar_updated <chr> "5 months ago", "4 mon...
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE...
## $ availability_30 <dbl> 29, 28, 29, 21, 0, 0, ...
## $ availability_60 <dbl> 59, 58, 59, 21, 0, 0, ...
## $ availability_90 <dbl> 89, 88, 89, 21, 0, 0, ...
## $ availability_365 <dbl> 89, 363, 172, 21, 0, 5...
## $ calendar_last_scraped <date> 2020-06-28, 2020-06-2...
## $ number_of_reviews <dbl> 168, 50, 293, 22, 90, ...
## $ number_of_reviews_ltm <dbl> 1, 4, 31, 2, 0, 0, 1, ...
## $ first_review <date> 2009-09-04, 2013-12-0...
## $ last_review <date> 2019-07-19, 2019-12-1...
## $ review_scores_rating <dbl> 96, 98, 91, 98, 94, 97...
## $ review_scores_accuracy <dbl> 10, 10, 10, 10, 10, 10...
## $ review_scores_cleanliness <dbl> 9, 10, 9, 10, 9, 10, 1...
## $ review_scores_checkin <dbl> 10, 10, 10, 10, 10, 10...
## $ review_scores_communication <dbl> 10, 10, 10, 10, 9, 10,...
## $ review_scores_location <dbl> 9, 10, 10, 10, 10, 10,...
## $ review_scores_value <dbl> 9, 10, 9, 10, 9, 9, 9,...
## $ requires_license <lgl> FALSE, FALSE, FALSE, F...
## $ license <lgl> NA, NA, NA, NA, NA, NA...
## $ jurisdiction_names <lgl> NA, NA, NA, NA, NA, NA...
## $ instant_bookable <lgl> FALSE, FALSE, FALSE, F...
## $ is_business_travel_ready <lgl> FALSE, FALSE, FALSE, F...
## $ cancellation_policy <chr> "moderate", "moderate"...
## $ require_guest_profile_picture <lgl> FALSE, FALSE, FALSE, F...
## $ require_guest_phone_verification <lgl> FALSE, FALSE, FALSE, F...
## $ calculated_host_listings_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1...
## $ calculated_host_listings_count_entire_homes <dbl> 0, 1, 1, 1, 1, 1, 1, 1...
## $ calculated_host_listings_count_private_rooms <dbl> 1, 0, 0, 0, 0, 0, 0, 0...
## $ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0...
## $ reviews_per_month <dbl> 1.28, 0.62, 2.48, 0.18...
Using glimpse() we find that there 28,523 observations (rows) across 106 variables (columns). We can already detect that some data types are assigned wrongly (for example the host_acceptance_rate is assigned as character, but should be numeric, double). We bear this in mind, alongside an awareness of other potential issues with the data to be addressed as we proceed.
Then, we skim the data to get a better picture of the dataset:
#we skim the listings data in order to identify variable types and distributions
skim(listings)
| Name | listings |
| Number of rows | 28523 |
| Number of columns | 106 |
| _______________________ | |
| Column type frequency: | |
| character | 45 |
| Date | 5 |
| logical | 16 |
| numeric | 40 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 33 | 37 | 0 | 28523 | 0 |
| name | 57 | 1.00 | 1 | 211 | 0 | 26907 | 0 |
| summary | 1096 | 0.96 | 1 | 1000 | 0 | 26986 | 0 |
| space | 11390 | 0.60 | 1 | 1000 | 0 | 16806 | 0 |
| description | 515 | 0.98 | 1 | 1000 | 0 | 27748 | 0 |
| experiences_offered | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| neighborhood_overview | 12407 | 0.57 | 1 | 1000 | 0 | 15520 | 0 |
| notes | 20832 | 0.27 | 1 | 1000 | 0 | 7366 | 0 |
| transit | 11316 | 0.60 | 1 | 1000 | 0 | 16714 | 0 |
| access | 15751 | 0.45 | 1 | 1000 | 0 | 11251 | 0 |
| interaction | 14111 | 0.51 | 1 | 1000 | 0 | 13728 | 0 |
| house_rules | 13332 | 0.53 | 1 | 1000 | 0 | 13703 | 0 |
| picture_url | 0 | 1.00 | 81 | 146 | 0 | 28282 | 0 |
| host_url | 0 | 1.00 | 37 | 43 | 0 | 25745 | 0 |
| host_name | 12 | 1.00 | 1 | 34 | 0 | 6415 | 0 |
| host_location | 93 | 1.00 | 2 | 152 | 0 | 863 | 0 |
| host_about | 14028 | 0.51 | 1 | 3550 | 0 | 12478 | 34 |
| host_response_time | 11 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 11 | 1.00 | 2 | 4 | 0 | 44 | 0 |
| host_acceptance_rate | 11 | 1.00 | 2 | 4 | 0 | 100 | 0 |
| host_thumbnail_url | 11 | 1.00 | 55 | 106 | 0 | 25669 | 0 |
| host_picture_url | 11 | 1.00 | 57 | 109 | 0 | 25669 | 0 |
| host_neighbourhood | 8007 | 0.72 | 1 | 21 | 0 | 56 | 0 |
| host_verifications | 0 | 1.00 | 2 | 156 | 0 | 316 | 0 |
| street | 0 | 1.00 | 10 | 61 | 0 | 528 | 0 |
| neighbourhood | 0 | 1.00 | 5 | 14 | 0 | 21 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 5 | 25 | 0 | 11 | 0 |
| city | 13 | 1.00 | 1 | 26 | 0 | 136 | 0 |
| state | 24042 | 0.16 | 1 | 25 | 0 | 170 | 0 |
| market | 864 | 0.97 | 6 | 21 | 0 | 10 | 0 |
| smart_location | 0 | 1.00 | 10 | 35 | 0 | 158 | 0 |
| country_code | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| country | 0 | 1.00 | 7 | 7 | 0 | 1 | 0 |
| property_type | 0 | 1.00 | 3 | 22 | 0 | 29 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bed_type | 0 | 1.00 | 5 | 13 | 0 | 5 | 0 |
| amenities | 0 | 1.00 | 2 | 1179 | 0 | 26634 | 0 |
| price | 0 | 1.00 | 5 | 10 | 0 | 611 | 0 |
| weekly_price | 25008 | 0.12 | 7 | 11 | 0 | 718 | 0 |
| monthly_price | 26971 | 0.05 | 7 | 11 | 0 | 476 | 0 |
| security_deposit | 13845 | 0.51 | 5 | 10 | 0 | 386 | 0 |
| cleaning_fee | 8968 | 0.69 | 5 | 9 | 0 | 430 | 0 |
| extra_people | 0 | 1.00 | 5 | 9 | 0 | 265 | 0 |
| calendar_updated | 0 | 1.00 | 5 | 13 | 0 | 82 | 0 |
| cancellation_policy | 0 | 1.00 | 8 | 27 | 0 | 4 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2020-06-26 | 2020-06-30 | 2020-06-27 | 5 |
| host_since | 11 | 1.00 | 2008-06-27 | 2020-06-26 | 2015-06-22 | 3379 |
| calendar_last_scraped | 0 | 1.00 | 2020-06-26 | 2020-06-30 | 2020-06-27 | 5 |
| first_review | 4968 | 0.83 | 2009-09-04 | 2020-06-28 | 2017-08-07 | 2582 |
| last_review | 4968 | 0.83 | 2011-08-10 | 2020-06-28 | 2019-07-29 | 1846 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| thumbnail_url | 28523 | 0 | NaN | : |
| medium_url | 28523 | 0 | NaN | : |
| xl_picture_url | 28523 | 0 | NaN | : |
| host_is_superhost | 11 | 1 | 0.10 | FAL: 25610, TRU: 2902 |
| host_has_profile_pic | 11 | 1 | 1.00 | TRU: 28445, FAL: 67 |
| host_identity_verified | 11 | 1 | 0.38 | FAL: 17790, TRU: 10722 |
| neighbourhood_group_cleansed | 28523 | 0 | NaN | : |
| is_location_exact | 0 | 1 | 0.79 | TRU: 22507, FAL: 6016 |
| has_availability | 0 | 1 | 1.00 | TRU: 28523 |
| requires_license | 0 | 1 | 0.00 | FAL: 28523 |
| license | 28523 | 0 | NaN | : |
| jurisdiction_names | 28523 | 0 | NaN | : |
| instant_bookable | 0 | 1 | 0.29 | FAL: 20360, TRU: 8163 |
| is_business_travel_ready | 0 | 1 | 0.00 | FAL: 28523 |
| require_guest_profile_picture | 0 | 1 | 0.01 | FAL: 28379, TRU: 144 |
| require_guest_phone_verification | 0 | 1 | 0.01 | FAL: 28327, TRU: 196 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.04e+07 | 1.23e+07 | 6.98e+03 | 1.01e+07 | 1.95e+07 | 3.04e+07 | 4.40e+07 | <U+2587><U+2587><U+2587><U+2585><U+2586> |
| scrape_id | 0 | 1.00 | 2.02e+13 | 0.00e+00 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | <U+2581><U+2581><U+2587><U+2581><U+2581> |
| host_id | 0 | 1.00 | 6.73e+07 | 7.62e+07 | 5.13e+02 | 1.28e+07 | 3.60e+07 | 9.46e+07 | 3.52e+08 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| host_listings_count | 11 | 1.00 | 5.46e+00 | 3.43e+01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 7.37e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| host_total_listings_count | 11 | 1.00 | 5.46e+00 | 3.43e+01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 7.37e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| zipcode | 815 | 0.97 | 2.07e+03 | 3.91e+02 | 2.00e+01 | 1.86e+03 | 2.20e+03 | 2.30e+03 | 2.10e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| latitude | 0 | 1.00 | 5.57e+01 | 2.00e-02 | 5.56e+01 | 5.57e+01 | 5.57e+01 | 5.57e+01 | 5.57e+01 | <U+2581><U+2583><U+2587><U+2587><U+2581> |
| longitude | 0 | 1.00 | 1.26e+01 | 3.00e-02 | 1.24e+01 | 1.25e+01 | 1.26e+01 | 1.26e+01 | 1.26e+01 | <U+2581><U+2582><U+2587><U+2585><U+2582> |
| accommodates | 0 | 1.00 | 3.32e+00 | 1.63e+00 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 4.00e+00 | 1.60e+01 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| bathrooms | 12 | 1.00 | 1.08e+00 | 2.80e-01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bedrooms | 29 | 1.00 | 1.55e+00 | 1.06e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 1.01e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 103 | 1.00 | 2.04e+00 | 1.44e+00 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 2.50e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| square_feet | 28130 | 0.01 | 7.22e+02 | 5.76e+02 | 0.00e+00 | 1.20e+02 | 7.64e+02 | 1.08e+03 | 2.80e+03 | <U+2587><U+2587><U+2583><U+2581><U+2581> |
| guests_included | 0 | 1.00 | 1.52e+00 | 1.06e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 1.60e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1.00 | 3.85e+00 | 1.81e+01 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 4.00e+00 | 1.10e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights | 0 | 1.00 | 6.21e+02 | 5.53e+02 | 1.00e+00 | 1.50e+01 | 1.12e+03 | 1.12e+03 | 1.00e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_minimum_nights | 0 | 1.00 | 3.84e+00 | 1.81e+01 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 4.00e+00 | 1.10e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_minimum_nights | 0 | 1.00 | 4.12e+00 | 1.93e+01 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 4.00e+00 | 1.10e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_maximum_nights | 0 | 1.00 | 6.67e+02 | 5.48e+02 | 1.00e+00 | 2.00e+01 | 1.12e+03 | 1.12e+03 | 1.00e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_maximum_nights | 0 | 1.00 | 6.70e+02 | 5.47e+02 | 1.00e+00 | 2.00e+01 | 1.12e+03 | 1.12e+03 | 1.00e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights_avg_ntm | 0 | 1.00 | 3.97e+00 | 1.84e+01 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 4.00e+00 | 1.10e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights_avg_ntm | 0 | 1.00 | 6.68e+02 | 5.47e+02 | 1.00e+00 | 2.00e+01 | 1.12e+03 | 1.12e+03 | 1.00e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_30 | 0 | 1.00 | 5.87e+00 | 1.04e+01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 8.00e+00 | 3.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| availability_60 | 0 | 1.00 | 1.12e+01 | 2.01e+01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.40e+01 | 6.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| availability_90 | 0 | 1.00 | 1.65e+01 | 3.00e+01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.80e+01 | 9.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_365 | 0 | 1.00 | 4.95e+01 | 9.92e+01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 3.50e+01 | 3.65e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews | 0 | 1.00 | 1.36e+01 | 2.71e+01 | 0.00e+00 | 1.00e+00 | 5.00e+00 | 1.50e+01 | 6.37e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_ltm | 0 | 1.00 | 2.74e+00 | 6.64e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 3.00e+00 | 3.76e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 5447 | 0.81 | 9.52e+01 | 6.83e+00 | 2.00e+01 | 9.30e+01 | 9.70e+01 | 1.00e+02 | 1.00e+02 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_accuracy | 5468 | 0.81 | 9.73e+00 | 6.60e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_cleanliness | 5466 | 0.81 | 9.38e+00 | 9.60e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_checkin | 5488 | 0.81 | 9.82e+00 | 5.60e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_communication | 5470 | 0.81 | 9.86e+00 | 5.30e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_location | 5491 | 0.81 | 9.60e+00 | 6.80e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_value | 5495 | 0.81 | 9.45e+00 | 7.60e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| calculated_host_listings_count | 0 | 1.00 | 4.45e+00 | 2.81e+01 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 2.81e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 4.17e+00 | 2.81e+01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 2.81e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 2.60e-01 | 7.30e-01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.20e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 1.00e-02 | 3.30e-01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.30e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| reviews_per_month | 4968 | 0.83 | 4.90e-01 | 7.30e-01 | 1.00e-02 | 1.20e-01 | 2.80e-01 | 5.90e-01 | 3.06e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
When we skim, we make 4 prominent observations:
Firstly, we find that a number of variables classified as character should in fact be factor (categorical) variables, on the basis that they have few unique levels.
Secondly, not all numeric variables are correctly coded as such.
Thirdly, we identify a number of variables with extensive missingness
Finally, we identify a number of unnecessary variables
Thus, before visualizing the data we want to classify the variables correctly. Hence, we create a list of the originally assigned variables by datatype:
kbl((c(colnames(listings[sapply(listings,is.numeric)]))), col.names=c("Numeric Variables")) %>%
kable_styling()
| Numeric Variables |
|---|
| id |
| scrape_id |
| host_id |
| host_listings_count |
| host_total_listings_count |
| zipcode |
| latitude |
| longitude |
| accommodates |
| bathrooms |
| bedrooms |
| beds |
| square_feet |
| guests_included |
| minimum_nights |
| maximum_nights |
| minimum_minimum_nights |
| maximum_minimum_nights |
| minimum_maximum_nights |
| maximum_maximum_nights |
| minimum_nights_avg_ntm |
| maximum_nights_avg_ntm |
| availability_30 |
| availability_60 |
| availability_90 |
| availability_365 |
| number_of_reviews |
| number_of_reviews_ltm |
| review_scores_rating |
| review_scores_accuracy |
| review_scores_cleanliness |
| review_scores_checkin |
| review_scores_communication |
| review_scores_location |
| review_scores_value |
| calculated_host_listings_count |
| calculated_host_listings_count_entire_homes |
| calculated_host_listings_count_private_rooms |
| calculated_host_listings_count_shared_rooms |
| reviews_per_month |
kbl((c(colnames(listings[sapply(listings,is.character)]))), col.names=c("Character Variables")) %>%
kable_styling()
| Character Variables |
|---|
| listing_url |
| name |
| summary |
| space |
| description |
| experiences_offered |
| neighborhood_overview |
| notes |
| transit |
| access |
| interaction |
| house_rules |
| picture_url |
| host_url |
| host_name |
| host_location |
| host_about |
| host_response_time |
| host_response_rate |
| host_acceptance_rate |
| host_thumbnail_url |
| host_picture_url |
| host_neighbourhood |
| host_verifications |
| street |
| neighbourhood |
| neighbourhood_cleansed |
| city |
| state |
| market |
| smart_location |
| country_code |
| country |
| property_type |
| room_type |
| bed_type |
| amenities |
| price |
| weekly_price |
| monthly_price |
| security_deposit |
| cleaning_fee |
| extra_people |
| calendar_updated |
| cancellation_policy |
kbl((c(colnames(listings[sapply(listings,is.logical)]))), col.names=c("Logical Variables")) %>%
kable_styling()
| Logical Variables |
|---|
| thumbnail_url |
| medium_url |
| xl_picture_url |
| host_is_superhost |
| host_has_profile_pic |
| host_identity_verified |
| neighbourhood_group_cleansed |
| is_location_exact |
| has_availability |
| requires_license |
| license |
| jurisdiction_names |
| instant_bookable |
| is_business_travel_ready |
| require_guest_profile_picture |
| require_guest_phone_verification |
kbl((c(colnames(listings[sapply(listings,is.Date)]))), col.names=c("Date Variables")) %>%
kable_styling()
| Date Variables |
|---|
| last_scraped |
| host_since |
| calendar_last_scraped |
| first_review |
| last_review |
We found out that a lot of variables don’t have the correct data type assigned. Therefore, we change the data types using as.factor and parse_number. Subsequently, we verify the conversion with typeof().
We now inspect the new dataset:
#we check that our recording has been successful
skim(listings_cleaned)
| Name | listings_cleaned |
| Number of rows | 28523 |
| Number of columns | 106 |
| _______________________ | |
| Column type frequency: | |
| character | 29 |
| Date | 5 |
| factor | 8 |
| logical | 16 |
| numeric | 48 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 33 | 37 | 0 | 28523 | 0 |
| name | 57 | 1.00 | 1 | 211 | 0 | 26907 | 0 |
| summary | 1096 | 0.96 | 1 | 1000 | 0 | 26986 | 0 |
| space | 11390 | 0.60 | 1 | 1000 | 0 | 16806 | 0 |
| description | 515 | 0.98 | 1 | 1000 | 0 | 27748 | 0 |
| experiences_offered | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| neighborhood_overview | 12407 | 0.57 | 1 | 1000 | 0 | 15520 | 0 |
| notes | 20832 | 0.27 | 1 | 1000 | 0 | 7366 | 0 |
| transit | 11316 | 0.60 | 1 | 1000 | 0 | 16714 | 0 |
| access | 15751 | 0.45 | 1 | 1000 | 0 | 11251 | 0 |
| interaction | 14111 | 0.51 | 1 | 1000 | 0 | 13728 | 0 |
| house_rules | 13332 | 0.53 | 1 | 1000 | 0 | 13703 | 0 |
| picture_url | 0 | 1.00 | 81 | 146 | 0 | 28282 | 0 |
| host_url | 0 | 1.00 | 37 | 43 | 0 | 25745 | 0 |
| host_name | 12 | 1.00 | 1 | 34 | 0 | 6415 | 0 |
| host_location | 93 | 1.00 | 2 | 152 | 0 | 863 | 0 |
| host_about | 14028 | 0.51 | 1 | 3550 | 0 | 12478 | 34 |
| host_thumbnail_url | 11 | 1.00 | 55 | 106 | 0 | 25669 | 0 |
| host_picture_url | 11 | 1.00 | 57 | 109 | 0 | 25669 | 0 |
| host_verifications | 0 | 1.00 | 2 | 156 | 0 | 316 | 0 |
| street | 0 | 1.00 | 10 | 61 | 0 | 528 | 0 |
| city | 13 | 1.00 | 1 | 26 | 0 | 136 | 0 |
| state | 24042 | 0.16 | 1 | 25 | 0 | 170 | 0 |
| market | 864 | 0.97 | 6 | 21 | 0 | 10 | 0 |
| smart_location | 0 | 1.00 | 10 | 35 | 0 | 158 | 0 |
| country_code | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| country | 0 | 1.00 | 7 | 7 | 0 | 1 | 0 |
| amenities | 0 | 1.00 | 2 | 1179 | 0 | 26634 | 0 |
| calendar_updated | 0 | 1.00 | 5 | 13 | 0 | 82 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2020-06-26 | 2020-06-30 | 2020-06-27 | 5 |
| host_since | 11 | 1.00 | 2008-06-27 | 2020-06-26 | 2015-06-22 | 3379 |
| calendar_last_scraped | 0 | 1.00 | 2020-06-26 | 2020-06-30 | 2020-06-27 | 5 |
| first_review | 4968 | 0.83 | 2009-09-04 | 2020-06-28 | 2017-08-07 | 2582 |
| last_review | 4968 | 0.83 | 2011-08-10 | 2020-06-28 | 2019-07-29 | 1846 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| host_response_time | 11 | 1.00 | FALSE | 5 | N/A: 22887, wit: 2371, wit: 1690, wit: 1151 |
| host_neighbourhood | 8007 | 0.72 | FALSE | 56 | Nør: 3876, Ves: 2919, Fre: 2399, Ind: 2375 |
| neighbourhood | 0 | 1.00 | FALSE | 21 | Nør: 5176, Ves: 4279, Ind: 3484, Fre: 3343 |
| neighbourhood_cleansed | 0 | 1.00 | FALSE | 11 | Nrr: 5166, Ves: 4447, Ind: 3784, Fre: 3354 |
| property_type | 0 | 1.00 | FALSE | 29 | Apa: 23942, Con: 1673, Hou: 1365, Tow: 554 |
| room_type | 0 | 1.00 | FALSE | 4 | Ent: 24054, Pri: 4353, Sha: 78, Hot: 38 |
| bed_type | 0 | 1.00 | FALSE | 5 | Rea: 28139, Pul: 207, Fut: 94, Cou: 46 |
| cancellation_policy | 0 | 1.00 | FALSE | 4 | fle: 12435, mod: 8919, str: 6888, sup: 281 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| thumbnail_url | 28523 | 0 | NaN | : |
| medium_url | 28523 | 0 | NaN | : |
| xl_picture_url | 28523 | 0 | NaN | : |
| host_is_superhost | 11 | 1 | 0.10 | FAL: 25610, TRU: 2902 |
| host_has_profile_pic | 11 | 1 | 1.00 | TRU: 28445, FAL: 67 |
| host_identity_verified | 11 | 1 | 0.38 | FAL: 17790, TRU: 10722 |
| neighbourhood_group_cleansed | 28523 | 0 | NaN | : |
| is_location_exact | 0 | 1 | 0.79 | TRU: 22507, FAL: 6016 |
| has_availability | 0 | 1 | 1.00 | TRU: 28523 |
| requires_license | 0 | 1 | 0.00 | FAL: 28523 |
| license | 28523 | 0 | NaN | : |
| jurisdiction_names | 28523 | 0 | NaN | : |
| instant_bookable | 0 | 1 | 0.29 | FAL: 20360, TRU: 8163 |
| is_business_travel_ready | 0 | 1 | 0.00 | FAL: 28523 |
| require_guest_profile_picture | 0 | 1 | 0.01 | FAL: 28379, TRU: 144 |
| require_guest_phone_verification | 0 | 1 | 0.01 | FAL: 28327, TRU: 196 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.04e+07 | 1.23e+07 | 6.98e+03 | 1.01e+07 | 1.95e+07 | 3.04e+07 | 4.40e+07 | <U+2587><U+2587><U+2587><U+2585><U+2586> |
| scrape_id | 0 | 1.00 | 2.02e+13 | 0.00e+00 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | <U+2581><U+2581><U+2587><U+2581><U+2581> |
| host_id | 0 | 1.00 | 6.73e+07 | 7.62e+07 | 5.13e+02 | 1.28e+07 | 3.60e+07 | 9.46e+07 | 3.52e+08 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| host_response_rate | 22898 | 0.20 | 8.98e+01 | 2.65e+01 | 0.00e+00 | 1.00e+02 | 1.00e+02 | 1.00e+02 | 1.00e+02 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| host_acceptance_rate | 12447 | 0.56 | 7.36e+01 | 3.24e+01 | 0.00e+00 | 5.00e+01 | 8.90e+01 | 1.00e+02 | 1.00e+02 | <U+2582><U+2581><U+2582><U+2582><U+2587> |
| host_listings_count | 11 | 1.00 | 5.46e+00 | 3.43e+01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 7.37e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| host_total_listings_count | 11 | 1.00 | 5.46e+00 | 3.43e+01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 7.37e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| zipcode | 815 | 0.97 | 2.07e+03 | 3.91e+02 | 2.00e+01 | 1.86e+03 | 2.20e+03 | 2.30e+03 | 2.10e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| latitude | 0 | 1.00 | 5.57e+01 | 2.00e-02 | 5.56e+01 | 5.57e+01 | 5.57e+01 | 5.57e+01 | 5.57e+01 | <U+2581><U+2583><U+2587><U+2587><U+2581> |
| longitude | 0 | 1.00 | 1.26e+01 | 3.00e-02 | 1.24e+01 | 1.25e+01 | 1.26e+01 | 1.26e+01 | 1.26e+01 | <U+2581><U+2582><U+2587><U+2585><U+2582> |
| accommodates | 0 | 1.00 | 3.32e+00 | 1.63e+00 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 4.00e+00 | 1.60e+01 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| bathrooms | 12 | 1.00 | 1.08e+00 | 2.80e-01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bedrooms | 29 | 1.00 | 1.55e+00 | 1.06e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 1.01e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 103 | 1.00 | 2.04e+00 | 1.44e+00 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 2.50e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| square_feet | 28130 | 0.01 | 7.22e+02 | 5.76e+02 | 0.00e+00 | 1.20e+02 | 7.64e+02 | 1.08e+03 | 2.80e+03 | <U+2587><U+2587><U+2583><U+2581><U+2581> |
| price | 0 | 1.00 | 8.49e+02 | 1.07e+03 | 0.00e+00 | 4.98e+02 | 6.98e+02 | 9.96e+02 | 6.92e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| weekly_price | 25008 | 0.12 | 4.90e+03 | 4.94e+03 | 1.50e+02 | 3.00e+03 | 4.02e+03 | 5.50e+03 | 1.49e+05 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| monthly_price | 26971 | 0.05 | 1.67e+04 | 1.88e+04 | 4.75e+02 | 1.00e+04 | 1.39e+04 | 2.00e+04 | 3.73e+05 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| security_deposit | 13845 | 0.51 | 1.45e+03 | 3.29e+03 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.50e+03 | 3.52e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| cleaning_fee | 8968 | 0.69 | 3.07e+02 | 2.48e+02 | 0.00e+00 | 1.50e+02 | 2.50e+02 | 4.00e+02 | 4.00e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| guests_included | 0 | 1.00 | 1.52e+00 | 1.06e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 1.60e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| extra_people | 0 | 1.00 | 6.39e+01 | 1.26e+02 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.00e+02 | 2.02e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1.00 | 3.85e+00 | 1.81e+01 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 4.00e+00 | 1.10e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights | 0 | 1.00 | 6.21e+02 | 5.53e+02 | 1.00e+00 | 1.50e+01 | 1.12e+03 | 1.12e+03 | 1.00e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_minimum_nights | 0 | 1.00 | 3.84e+00 | 1.81e+01 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 4.00e+00 | 1.10e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_minimum_nights | 0 | 1.00 | 4.12e+00 | 1.93e+01 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 4.00e+00 | 1.10e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_maximum_nights | 0 | 1.00 | 6.67e+02 | 5.48e+02 | 1.00e+00 | 2.00e+01 | 1.12e+03 | 1.12e+03 | 1.00e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_maximum_nights | 0 | 1.00 | 6.70e+02 | 5.47e+02 | 1.00e+00 | 2.00e+01 | 1.12e+03 | 1.12e+03 | 1.00e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights_avg_ntm | 0 | 1.00 | 3.97e+00 | 1.84e+01 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 4.00e+00 | 1.10e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights_avg_ntm | 0 | 1.00 | 6.68e+02 | 5.47e+02 | 1.00e+00 | 2.00e+01 | 1.12e+03 | 1.12e+03 | 1.00e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_30 | 0 | 1.00 | 5.87e+00 | 1.04e+01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 8.00e+00 | 3.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| availability_60 | 0 | 1.00 | 1.12e+01 | 2.01e+01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.40e+01 | 6.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| availability_90 | 0 | 1.00 | 1.65e+01 | 3.00e+01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.80e+01 | 9.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_365 | 0 | 1.00 | 4.95e+01 | 9.92e+01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 3.50e+01 | 3.65e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews | 0 | 1.00 | 1.36e+01 | 2.71e+01 | 0.00e+00 | 1.00e+00 | 5.00e+00 | 1.50e+01 | 6.37e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_ltm | 0 | 1.00 | 2.74e+00 | 6.64e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 3.00e+00 | 3.76e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 5447 | 0.81 | 9.52e+01 | 6.83e+00 | 2.00e+01 | 9.30e+01 | 9.70e+01 | 1.00e+02 | 1.00e+02 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_accuracy | 5468 | 0.81 | 9.73e+00 | 6.60e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_cleanliness | 5466 | 0.81 | 9.38e+00 | 9.60e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_checkin | 5488 | 0.81 | 9.82e+00 | 5.60e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_communication | 5470 | 0.81 | 9.86e+00 | 5.30e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_location | 5491 | 0.81 | 9.60e+00 | 6.80e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_value | 5495 | 0.81 | 9.45e+00 | 7.60e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| calculated_host_listings_count | 0 | 1.00 | 4.45e+00 | 2.81e+01 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 2.81e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 4.17e+00 | 2.81e+01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 2.81e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 2.60e-01 | 7.30e-01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.20e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 1.00e-02 | 3.30e-01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.30e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| reviews_per_month | 4968 | 0.83 | 4.90e-01 | 7.30e-01 | 1.00e-02 | 1.20e-01 | 2.80e-01 | 5.90e-01 | 3.06e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
We compare the newly created dataset, listings_cleaned, to the original dataset, listings, and detect that the count of N/A’s seem to vary for host_response_rate and host_response_time.
To understand the reason, we inspect one variable in depth: host_response_rate. In order to do this, we print the unique values of host_response_rate for both datasets and then skim the variable characteristics in each dataset.
#we look at the variable host_response_rate in depth
unique(listings$host_response_rate) #levels of host_response_rate in listings
## [1] "N/A" "100%" "67%" "80%" "50%" "83%" "0%" "82%" "64%" "71%"
## [11] "33%" "75%" "90%" NA "60%" "88%" "78%" "86%" "89%" "79%"
## [21] "77%" "25%" "40%" "91%" "20%" "14%" "17%" "29%" "70%" "94%"
## [31] "43%" "63%" "87%" "93%" "97%" "56%" "38%" "99%" "57%" "44%"
## [41] "92%" "30%" "36%" "10%" "59%"
unique(listings_cleaned$host_response_rate) #levels of host_response_rate in listings_cleaned
## [1] NA 100 67 80 50 83 0 82 64 71 33 75 90 60 88 78 86 89 79
## [20] 77 25 40 91 20 14 17 29 70 94 43 63 87 93 97 56 38 99 57
## [39] 44 92 30 36 10 59
skim(listings$host_response_rate)
| Name | listings$host_response_ra… |
| Number of rows | 28523 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| data | 11 | 1 | 2 | 4 | 0 | 44 | 0 |
skim(listings_cleaned$host_response_rate)
| Name | listings_cleaned$host_res… |
| Number of rows | 28523 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 22898 | 0.2 | 89.8 | 26.5 | 0 | 100 | 100 | 100 | 100 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
We find that in the dataset ‘listings’ we have two different N/A values: “N/A” and NA, while in the ‘listings_cleaned’ dataset, both are grouped under NA. Moreover, we detect that in the count of n_missing there is a difference of 22887. Let’s count the number of “N/A” observations in ‘listings’, apart from the 11 NA.
listings%>%
filter(host_response_rate=="N/A") %>%
summarise(count=n())
| count |
|---|
| 22887 |
Now we can be confident that our transformation of variables has been accurateand has not created unintended NAs, we move onto selecting only the most prominent variables for our subsequent analyses.
Before we generate summary statistics and visualizations, we select relevant variables as we do not need all 106. To be able to easily refer to the variables, we print a list of all variables with their corresponding column number, which we use as an identifier:
variable_codes <-
c(colnames(
listings_cleaned)
)
print(variable_codes)
## [1] "id"
## [2] "listing_url"
## [3] "scrape_id"
## [4] "last_scraped"
## [5] "name"
## [6] "summary"
## [7] "space"
## [8] "description"
## [9] "experiences_offered"
## [10] "neighborhood_overview"
## [11] "notes"
## [12] "transit"
## [13] "access"
## [14] "interaction"
## [15] "house_rules"
## [16] "thumbnail_url"
## [17] "medium_url"
## [18] "picture_url"
## [19] "xl_picture_url"
## [20] "host_id"
## [21] "host_url"
## [22] "host_name"
## [23] "host_since"
## [24] "host_location"
## [25] "host_about"
## [26] "host_response_time"
## [27] "host_response_rate"
## [28] "host_acceptance_rate"
## [29] "host_is_superhost"
## [30] "host_thumbnail_url"
## [31] "host_picture_url"
## [32] "host_neighbourhood"
## [33] "host_listings_count"
## [34] "host_total_listings_count"
## [35] "host_verifications"
## [36] "host_has_profile_pic"
## [37] "host_identity_verified"
## [38] "street"
## [39] "neighbourhood"
## [40] "neighbourhood_cleansed"
## [41] "neighbourhood_group_cleansed"
## [42] "city"
## [43] "state"
## [44] "zipcode"
## [45] "market"
## [46] "smart_location"
## [47] "country_code"
## [48] "country"
## [49] "latitude"
## [50] "longitude"
## [51] "is_location_exact"
## [52] "property_type"
## [53] "room_type"
## [54] "accommodates"
## [55] "bathrooms"
## [56] "bedrooms"
## [57] "beds"
## [58] "bed_type"
## [59] "amenities"
## [60] "square_feet"
## [61] "price"
## [62] "weekly_price"
## [63] "monthly_price"
## [64] "security_deposit"
## [65] "cleaning_fee"
## [66] "guests_included"
## [67] "extra_people"
## [68] "minimum_nights"
## [69] "maximum_nights"
## [70] "minimum_minimum_nights"
## [71] "maximum_minimum_nights"
## [72] "minimum_maximum_nights"
## [73] "maximum_maximum_nights"
## [74] "minimum_nights_avg_ntm"
## [75] "maximum_nights_avg_ntm"
## [76] "calendar_updated"
## [77] "has_availability"
## [78] "availability_30"
## [79] "availability_60"
## [80] "availability_90"
## [81] "availability_365"
## [82] "calendar_last_scraped"
## [83] "number_of_reviews"
## [84] "number_of_reviews_ltm"
## [85] "first_review"
## [86] "last_review"
## [87] "review_scores_rating"
## [88] "review_scores_accuracy"
## [89] "review_scores_cleanliness"
## [90] "review_scores_checkin"
## [91] "review_scores_communication"
## [92] "review_scores_location"
## [93] "review_scores_value"
## [94] "requires_license"
## [95] "license"
## [96] "jurisdiction_names"
## [97] "instant_bookable"
## [98] "is_business_travel_ready"
## [99] "cancellation_policy"
## [100] "require_guest_profile_picture"
## [101] "require_guest_phone_verification"
## [102] "calculated_host_listings_count"
## [103] "calculated_host_listings_count_entire_homes"
## [104] "calculated_host_listings_count_private_rooms"
## [105] "calculated_host_listings_count_shared_rooms"
## [106] "reviews_per_month"
Now, we can select the variables that we believe will truly be relevant to our subsequent analyses:
#we subsequently select relevant variables on the basis of their column numbers in the listings_cleaned dataframe
#the listings_necessary dataframe contains all the variables and data with which we will proceed: below we provide a brief annotation of which variables we chose
listings_necessary <-
listings_cleaned %>%
select(c(2, #listings url
#5:10, #if we than can use keywords to filter relevant information
15, #house rules
23, #host since
27:29, #host info: host_response_rate, host_acceptance_rate, host_is_superhost
33, #host listings count
35:37, #host info: host_verifications, host_has_profile_pic, host_identity_verified
39:41, #neighborhood: neighbourhood, neighbourhood_cleansed, neighbourhood_group_cleansed
44, #zipcode
49:50,#latitude, longitude
51:61, #flat info: is_location_exact, property_type, room_type, accommodates, bathrooms, bedrooms, beds, amenities, square_feet, price
64:68, # security_deposit, leaning_fee, guests_included, extra_people, minimum_nights
83:93, #number_of_reviews. number_of_reviews_ltm", "first_review, last_review, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communicatio, review_scores_location. review_scores_value"
97, #instant_bookable
98, #is_business_travel_ready
99, #cancellation_policy
106) #reviews_per_month
)
Now that we have created listings_necessary, we will be working only with the columns (variables) which we think could be relevant to our subsequent analyses. Note that we deliberately excluded the following:
listings %>%
select(- colnames(listings_necessary)) %>%
colnames
## [1] "id"
## [2] "scrape_id"
## [3] "last_scraped"
## [4] "name"
## [5] "summary"
## [6] "space"
## [7] "description"
## [8] "experiences_offered"
## [9] "neighborhood_overview"
## [10] "notes"
## [11] "transit"
## [12] "access"
## [13] "interaction"
## [14] "thumbnail_url"
## [15] "medium_url"
## [16] "picture_url"
## [17] "xl_picture_url"
## [18] "host_id"
## [19] "host_url"
## [20] "host_name"
## [21] "host_location"
## [22] "host_about"
## [23] "host_response_time"
## [24] "host_thumbnail_url"
## [25] "host_picture_url"
## [26] "host_neighbourhood"
## [27] "host_total_listings_count"
## [28] "street"
## [29] "city"
## [30] "state"
## [31] "market"
## [32] "smart_location"
## [33] "country_code"
## [34] "country"
## [35] "weekly_price"
## [36] "monthly_price"
## [37] "maximum_nights"
## [38] "minimum_minimum_nights"
## [39] "maximum_minimum_nights"
## [40] "minimum_maximum_nights"
## [41] "maximum_maximum_nights"
## [42] "minimum_nights_avg_ntm"
## [43] "maximum_nights_avg_ntm"
## [44] "calendar_updated"
## [45] "has_availability"
## [46] "availability_30"
## [47] "availability_60"
## [48] "availability_90"
## [49] "availability_365"
## [50] "calendar_last_scraped"
## [51] "requires_license"
## [52] "license"
## [53] "jurisdiction_names"
## [54] "require_guest_profile_picture"
## [55] "require_guest_phone_verification"
## [56] "calculated_host_listings_count"
## [57] "calculated_host_listings_count_entire_homes"
## [58] "calculated_host_listings_count_private_rooms"
## [59] "calculated_host_listings_count_shared_rooms"
Therefore, variables such as the host names, host ID and descriptive variables are not in the selected list, since they do not provide any meaningful insights. However, we have not yet reduced the variables down only to those we intend to use - there are many variables here which are likely to be redundant in our final analyses, but which we must explore further to decide whether to use in their present form, transform further into a useful form, or eliminate altogether.
Before we make further evaluations of which variables matter most to our analyses, we address missingness, which is a prominent issue for a small number of our chosen variables.
Identifying Missing Values
We ask the question: “are there any missing values?”, as we review the transformed data with the skim() function.
# Looking at the transformed data to find out variables with missing values
skim(listings_necessary)
| Name | listings_necessary |
| Number of rows | 28523 |
| Number of columns | 47 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| Date | 3 |
| factor | 6 |
| logical | 7 |
| numeric | 27 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 33 | 37 | 0 | 28523 | 0 |
| house_rules | 13332 | 0.53 | 1 | 1000 | 0 | 13703 | 0 |
| host_verifications | 0 | 1.00 | 2 | 156 | 0 | 316 | 0 |
| amenities | 0 | 1.00 | 2 | 1179 | 0 | 26634 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| host_since | 11 | 1.00 | 2008-06-27 | 2020-06-26 | 2015-06-22 | 3379 |
| first_review | 4968 | 0.83 | 2009-09-04 | 2020-06-28 | 2017-08-07 | 2582 |
| last_review | 4968 | 0.83 | 2011-08-10 | 2020-06-28 | 2019-07-29 | 1846 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| neighbourhood | 0 | 1 | FALSE | 21 | Nør: 5176, Ves: 4279, Ind: 3484, Fre: 3343 |
| neighbourhood_cleansed | 0 | 1 | FALSE | 11 | Nrr: 5166, Ves: 4447, Ind: 3784, Fre: 3354 |
| property_type | 0 | 1 | FALSE | 29 | Apa: 23942, Con: 1673, Hou: 1365, Tow: 554 |
| room_type | 0 | 1 | FALSE | 4 | Ent: 24054, Pri: 4353, Sha: 78, Hot: 38 |
| bed_type | 0 | 1 | FALSE | 5 | Rea: 28139, Pul: 207, Fut: 94, Cou: 46 |
| cancellation_policy | 0 | 1 | FALSE | 4 | fle: 12435, mod: 8919, str: 6888, sup: 281 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 11 | 1 | 0.10 | FAL: 25610, TRU: 2902 |
| host_has_profile_pic | 11 | 1 | 1.00 | TRU: 28445, FAL: 67 |
| host_identity_verified | 11 | 1 | 0.38 | FAL: 17790, TRU: 10722 |
| neighbourhood_group_cleansed | 28523 | 0 | NaN | : |
| is_location_exact | 0 | 1 | 0.79 | TRU: 22507, FAL: 6016 |
| instant_bookable | 0 | 1 | 0.29 | FAL: 20360, TRU: 8163 |
| is_business_travel_ready | 0 | 1 | 0.00 | FAL: 28523 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| host_response_rate | 22898 | 0.20 | 89.75 | 26.47 | 0.00 | 100.00 | 100.00 | 100.00 | 100.0 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| host_acceptance_rate | 12447 | 0.56 | 73.56 | 32.40 | 0.00 | 50.00 | 89.00 | 100.00 | 100.0 | <U+2582><U+2581><U+2582><U+2582><U+2587> |
| host_listings_count | 11 | 1.00 | 5.46 | 34.32 | 0.00 | 1.00 | 1.00 | 1.00 | 737.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| zipcode | 815 | 0.97 | 2069.76 | 390.81 | 20.00 | 1855.75 | 2200.00 | 2300.00 | 21000.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| latitude | 0 | 1.00 | 55.68 | 0.02 | 55.62 | 55.67 | 55.68 | 55.70 | 55.7 | <U+2581><U+2583><U+2587><U+2587><U+2581> |
| longitude | 0 | 1.00 | 12.56 | 0.03 | 12.45 | 12.54 | 12.55 | 12.58 | 12.6 | <U+2581><U+2582><U+2587><U+2585><U+2582> |
| accommodates | 0 | 1.00 | 3.32 | 1.63 | 1.00 | 2.00 | 3.00 | 4.00 | 16.0 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| bathrooms | 12 | 1.00 | 1.08 | 0.28 | 0.00 | 1.00 | 1.00 | 1.00 | 10.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bedrooms | 29 | 1.00 | 1.55 | 1.06 | 0.00 | 1.00 | 1.00 | 2.00 | 101.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 103 | 1.00 | 2.04 | 1.44 | 0.00 | 1.00 | 2.00 | 3.00 | 25.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| square_feet | 28130 | 0.01 | 721.89 | 576.10 | 0.00 | 120.00 | 764.00 | 1076.00 | 2799.0 | <U+2587><U+2587><U+2583><U+2581><U+2581> |
| price | 0 | 1.00 | 849.19 | 1066.81 | 0.00 | 498.00 | 698.00 | 996.00 | 69175.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| security_deposit | 13845 | 0.51 | 1448.66 | 3292.68 | 0.00 | 0.00 | 0.00 | 1500.00 | 35198.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| cleaning_fee | 8968 | 0.69 | 306.88 | 247.65 | 0.00 | 150.00 | 250.00 | 400.00 | 4000.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| guests_included | 0 | 1.00 | 1.52 | 1.06 | 1.00 | 1.00 | 1.00 | 2.00 | 16.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| extra_people | 0 | 1.00 | 63.88 | 126.23 | 0.00 | 0.00 | 0.00 | 100.00 | 2024.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1.00 | 3.85 | 18.07 | 1.00 | 2.00 | 3.00 | 4.00 | 1100.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews | 0 | 1.00 | 13.57 | 27.09 | 0.00 | 1.00 | 5.00 | 15.00 | 637.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_ltm | 0 | 1.00 | 2.74 | 6.64 | 0.00 | 0.00 | 0.00 | 3.00 | 376.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 5447 | 0.81 | 95.16 | 6.83 | 20.00 | 93.00 | 97.00 | 100.00 | 100.0 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_accuracy | 5468 | 0.81 | 9.73 | 0.66 | 2.00 | 10.00 | 10.00 | 10.00 | 10.0 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_cleanliness | 5466 | 0.81 | 9.38 | 0.96 | 2.00 | 9.00 | 10.00 | 10.00 | 10.0 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_checkin | 5488 | 0.81 | 9.82 | 0.56 | 2.00 | 10.00 | 10.00 | 10.00 | 10.0 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_communication | 5470 | 0.81 | 9.86 | 0.53 | 2.00 | 10.00 | 10.00 | 10.00 | 10.0 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_location | 5491 | 0.81 | 9.60 | 0.68 | 2.00 | 9.00 | 10.00 | 10.00 | 10.0 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_value | 5495 | 0.81 | 9.45 | 0.76 | 2.00 | 9.00 | 10.00 | 10.00 | 10.0 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| reviews_per_month | 4968 | 0.83 | 0.49 | 0.73 | 0.01 | 0.12 | 0.28 | 0.59 | 30.6 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
We identify a few key variables with substantial missingness, including cleaning_fee, security_deposit, bathrooms, bedrooms, beds, and review scores, among others. We address these by asking ourselves (and investigating in the chunk below):
In addition, we employ the chunk below to investigate the questions:
We find that the number of missing values for cleaning_fee is 8968.
The reason for so many missing values in cleaning_fee is likely to be be the absence of cleaning_fee in those listings. Building up on this logic, we carry out the imputation of missing values of cleaning_fee by 0. The same logic is applied for the imputation of security_deposit, since its absence in the data is most likely because there is no security deposit required for the listing.
Addressing Missing Values
#we find that the variables "host_response_rate","weekly_price","monthly_price" and "host_response_time" have more than 80% data missing, hence we do not impute them, since this would most likely introduce errors into our analyses
# missing value of cleaning_fee indicates there wasn't any cleaning fee and hence we impute it by 0
listings_necessary <- listings_necessary %>%
mutate(cleaning_fee = case_when(
is.na(cleaning_fee) ~ 0,
TRUE ~ cleaning_fee
))
# missing value of security_deposit indicates there wasn't any security deposit and hence we impute it by 0
listings_necessary <- listings_necessary %>%
mutate(security_deposit = case_when(
is.na(security_deposit) ~ 0,
TRUE ~ security_deposit
))
# missing value of bathrooms indicates there wasn't any bathroom and hence we impute it by 0
listings_necessary <- listings_necessary %>%
mutate(bathrooms = case_when(
is.na(bathrooms) ~ 0,
TRUE ~ bathrooms
))
# missing value of bedrooms indicates there wasn't any bedroom and hence we impute it by 0. We have values like "Shared room", "Hotel room" and "Private room" which may indicate ambiguity on their status as a bedroom
listings_necessary <- listings_necessary %>%
mutate(bedrooms = case_when(
is.na(bedrooms) ~ 0,
TRUE ~ bedrooms
))
# If it's not mentioned then it's possible there is no bed at all. It is also possible that there is a "sofa" or an "airbed" in place which are the values of the variable bed_type
listings_necessary <- listings_necessary %>%
mutate(beds = case_when(
is.na(beds) ~ 0,
TRUE ~ beds
))
# These are probably new listings which do not have a rating yet. We impute these ratings by a central data point : Median
listings_necessary <- listings_necessary %>%
mutate(review_scores_rating = case_when(
is.na(review_scores_rating ) ~ median(review_scores_rating,na.rm = TRUE),
TRUE ~ review_scores_rating
))
# These are probably new listings which do not have a rating yet. We impute these ratings by a central data point : Median
listings_necessary <- listings_necessary %>%
mutate(review_scores_accuracy = case_when(
is.na(review_scores_accuracy) ~ median(review_scores_accuracy, na.rm = TRUE),
TRUE ~ review_scores_accuracy
))
# These are probably new listings which do not have a rating yet. We impute these ratings by a central data point : Median
listings_necessary <- listings_necessary %>%
mutate(review_scores_cleanliness = case_when(
is.na(review_scores_cleanliness) ~ median(review_scores_cleanliness, na.rm = TRUE),
TRUE ~ review_scores_cleanliness
))
# These are probably new listings which do not have a rating yet. We impute these ratings by a central data point : Median
listings_necessary <- listings_necessary %>%
mutate(review_scores_checkin = case_when(
is.na(review_scores_checkin) ~ median(review_scores_checkin,na.rm = TRUE),
TRUE ~ review_scores_checkin
))
# These are probably new listings which do not have a rating yet. We impute these ratings by a central data point : Median
listings_necessary <- listings_necessary %>%
mutate(review_scores_communication = case_when(
is.na(review_scores_communication) ~ median(review_scores_communication,na.rm = TRUE),
TRUE ~ review_scores_communication
))
# These are probably new listings which do not have a rating yet. We impute these ratings by a central data point : Median
listings_necessary <- listings_necessary %>%
mutate(review_scores_location = case_when(
is.na(review_scores_location) ~ median(review_scores_location, na.rm = TRUE),
TRUE ~ review_scores_location
))
# These are probably new listings which do not have a rating yet. We impute these ratings by a central data point : Median
listings_necessary <- listings_necessary %>%
mutate(review_scores_value = case_when(
is.na(review_scores_value) ~ median(review_scores_value, na.rm = TRUE ),
TRUE ~ review_scores_value
))
# These are probably new listings which do not have a rating yet. We impute these ratings by a central data point : Median
listings_necessary <- listings_necessary %>%
mutate(reviews_per_month = case_when(
is.na(reviews_per_month) ~ median(reviews_per_month, na.rm = TRUE),
TRUE ~ reviews_per_month
))
# The following variables have 11 datapoints which are simultaneously missing. This maybe caused due to some issues in data storage, however we will follow a conservative approach in imputing these, which is undertaken as follows:
# If the host was a Superhost, his data wouldn't be missing and hence we impute it by FALSE
listings_necessary <- listings_necessary %>%
mutate(host_is_superhost = case_when(
is.na(host_is_superhost) ~ FALSE,
TRUE ~ host_is_superhost
))
# We impute host_has_profile_pic by FALSE following our conservative approach and previous logic for Superhosts: if it were true, it would be true - no response means there is most likely no profile picture
listings_necessary <- listings_necessary %>%
mutate(host_has_profile_pic = case_when(
is.na(host_has_profile_pic) ~ FALSE,
TRUE ~ host_has_profile_pic
))
# We impute host_identity_verified by FALSE following our conservative approach and previous logic for Superhosts: if it were true, it would be true - no response means there is most likely no verification
listings_necessary <- listings_necessary %>%
mutate(host_identity_verified = case_when(
is.na(host_identity_verified) ~ FALSE,
TRUE ~ host_identity_verified
))
# These are probably new listings which donot have their listings recorded. We impute it by a central value (Median)
listings_necessary <- listings_necessary %>%
mutate(host_listings_count = case_when(
is.na(host_listings_count) ~ median(host_listings_count, na.rm =TRUE),
TRUE ~ host_listings_count
))
Creating prop_type_simplified
Further, we have certain variables, including property_type that have a known set of values but a skewed frequency distribution. In order to clean these, we will build factor variables by grouping the values of lower frequency into one category and keeping the values with higher frequency as categories. We perform these operations in the chunk below:
#we now make factor variables from variables having a known set of possible values but only a few values with high frequency
# first, we look at frequencies of different values of property_type
count_property_type <- listings_necessary %>%
count(property_type) %>%
arrange(desc(n))
# next, we use this to produce a tabular Output for Frequency Distribution of property_type
kbl(count_property_type, col.names=c("Values of property type","Frequency")) %>%
kable_styling()
| Values of property type | Frequency |
|---|---|
| Apartment | 23942 |
| Condominium | 1673 |
| House | 1365 |
| Townhouse | 554 |
| Serviced apartment | 331 |
| Loft | 231 |
| Villa | 190 |
| Hostel | 28 |
| Guesthouse | 23 |
| Bungalow | 22 |
| Guest suite | 22 |
| Bed and breakfast | 21 |
| Houseboat | 21 |
| Boat | 20 |
| Hotel | 20 |
| Other | 14 |
| Tiny house | 14 |
| Cabin | 10 |
| Boutique hotel | 5 |
| Cottage | 4 |
| Hut | 3 |
| Aparthotel | 2 |
| Barn | 2 |
| Camper/RV | 1 |
| Casa particular (Cuba) | 1 |
| Earth house | 1 |
| Island | 1 |
| Lighthouse | 1 |
| Tent | 1 |
We find that the 4 most common property types are Apartment, Condominium, House and Townhouse (in that order).
We now check the percentage of the total data points that the top 4 most frequent values of property types constitute:
(sum(count_property_type[1:4,"n"])/sum(count_property_type[,"n"]))*100
## [1] 96.5
The 4 most common property types make up 96.5% of the total listings in our data, which is certainly high enough to use a simplified version of property type. We select the top 4 values and bundle the remaining ones into “Other”.
#we now select the top 4 values as categories and bundle everything else in "Other"
listings_necessary <- listings_necessary %>%
mutate(prop_type_simplified = case_when(
as.character(property_type) %in% c("Apartment","Condominium", "House","Townhouse") ~ as.character(property_type),
TRUE ~ "Others"
))
#we now check for the correct creation of prop_type_simplified. We look at the frequency distribution and find out that the new frequency distribution is consistent with the older one
prop_type_creation_check <- listings_necessary %>%
count(property_type, prop_type_simplified) %>%
arrange(desc(n))
#as a result of this output, we can be confident that prop_type_simplified was correctly made
#we also provide a tabular output of the above classification check
kbl(prop_type_creation_check,col.names=c("Property Type","Property Type Simplified","Number of Listings")) %>%
kable_styling()
| Property Type | Property Type Simplified | Number of Listings |
|---|---|---|
| Apartment | Apartment | 23942 |
| Condominium | Condominium | 1673 |
| House | House | 1365 |
| Townhouse | Townhouse | 554 |
| Serviced apartment | Others | 331 |
| Loft | Others | 231 |
| Villa | Others | 190 |
| Hostel | Others | 28 |
| Guesthouse | Others | 23 |
| Bungalow | Others | 22 |
| Guest suite | Others | 22 |
| Bed and breakfast | Others | 21 |
| Houseboat | Others | 21 |
| Boat | Others | 20 |
| Hotel | Others | 20 |
| Other | Others | 14 |
| Tiny house | Others | 14 |
| Cabin | Others | 10 |
| Boutique hotel | Others | 5 |
| Cottage | Others | 4 |
| Hut | Others | 3 |
| Aparthotel | Others | 2 |
| Barn | Others | 2 |
| Camper/RV | Others | 1 |
| Casa particular (Cuba) | Others | 1 |
| Earth house | Others | 1 |
| Island | Others | 1 |
| Lighthouse | Others | 1 |
| Tent | Others | 1 |
Creating Other Variables
(1) Creating accommodates_simplified
We apply the same reasoning from our recoding of property_type into property_type_simplified to create other simplified variables which will be of greater use in our subsequent analyses.
We create the new variable acommodates_simplified. To do so we first look at the distribution of accommodates:
#we evaluate the frequencies of the different levels within the accommodates variable (the Number of people a listing can accommodate)
count_accommodation_limit <- listings_necessary %>%
count(accommodates) %>%
arrange(desc(n))
#next, we use this to produce a tabular Output for Frequency Distribution of accommodates
kbl(count_accommodation_limit, col.names=c("Number of People Listing Accommodates","Number of Such Listings")) %>%
kable_styling()
| Number of People Listing Accommodates | Number of Such Listings |
|---|---|
| 2 | 11649 |
| 4 | 6828 |
| 3 | 3842 |
| 5 | 2137 |
| 6 | 2089 |
| 1 | 967 |
| 8 | 434 |
| 7 | 346 |
| 10 | 94 |
| 9 | 88 |
| 12 | 26 |
| 11 | 10 |
| 16 | 10 |
| 14 | 2 |
| 15 | 1 |
We find that the distribution is evidently right (positively) skewed, so we can group greater observations to one value we we did before. Therefore, we keep the levels 1-6 and group all levels >7 into 7 in order to transform accommodates into accommodates_simplified:
#we keep levels 2,4,3,5,6,1 and group number of accommodates greater than 7 into 7
listings_necessary <- listings_necessary %>%
mutate(accommodates_simplified = case_when(
accommodates %in% c(2,4,3,5,6,1) ~ accommodates,
TRUE ~ 7
))
#we check for the correct creation of accommodates_simplified. We look at the frequency distribution and find out that the new frequency distribution is consistent with the older one
accommodates_simplified_creation_check <- listings_necessary %>%
count(accommodates, accommodates_simplified) %>%
arrange(desc(n))
#as a result of this output, we can be confident that prop_type_simplified was correctly made
#we also provide a tabular output of the above classification check
kbl(accommodates_simplified_creation_check,col.names=c("Accommodates","Accommodates Simplified","Number of Listings")) %>%
kable_styling()
| Accommodates | Accommodates Simplified | Number of Listings |
|---|---|---|
| 2 | 2 | 11649 |
| 4 | 4 | 6828 |
| 3 | 3 | 3842 |
| 5 | 5 | 2137 |
| 6 | 6 | 2089 |
| 1 | 1 | 967 |
| 8 | 7 | 434 |
| 7 | 7 | 346 |
| 10 | 7 | 94 |
| 9 | 7 | 88 |
| 12 | 7 | 26 |
| 11 | 7 | 10 |
| 16 | 7 | 10 |
| 14 | 7 | 2 |
| 15 | 7 | 1 |
(2) Changing the data type of prop_type_simplified and accommodates_simplified
Now, we change the data type of the two newly created variables, making factor variables (as they have a known set of possible values but only a few values with high frequency):
#since both of these variables are of type character, we convert them to factor variables
listings_necessary <- listings_necessary %>%
mutate(
#we recode these columns to factors
accommodates_simplified = as.numeric(accommodates_simplified),
prop_type_simplified = as.factor(prop_type_simplified)
)
(3) Focusing on Leisure, not business
Further, we want to focus only on listings that are not made for business travel but merely for leisure. Therefore, we count the number of business_travel_ready listings:
#we find out the number of listings that cater to travel
count_is_business_travel_ready <- listings_necessary %>%
count(is_business_travel_ready) %>%
arrange(desc(n))
#we summarise this as a tabular output
kbl(count_is_business_travel_ready, col.names=c("Business Travel Ready?","Count")) %>%
kable_styling()
| Business Travel Ready? | Count |
|---|---|
| FALSE | 28523 |
We find that none of the listings is ready for business travel, and thus can keep all listings as leisure listings.
(4) Putting Bedrooms, Bathrooms and Beds in relation to accommodates_simplified
Bedrooms, bathrooms and beds are more insightful if we put them into relation to the accommodates.
We do this by creating new variables:
#bedrooms_per_capacity = bedrooms / accommodates. This represents the number of bedrooms available to one person, given that the maximum accommodation possible in the property is "accommodates"
listings_necessary <- listings_necessary %>%
mutate(
bedrooms_per_capacity = bedrooms/accommodates
)
#beds_per_capacity = beds / accommodates. This represents the number of beds available to one person, given that the maximum accommodation possible in the property is "accommodates"
listings_necessary <- listings_necessary %>%
mutate(
beds_per_capacity = beds/accommodates
)
#bathrooms_per_capacity = bathrooms / accommodates. This represents the number of bathrooms available to one person, given that the maximum accommodation possible in the property is "accommodates"
listings_necessary <- listings_necessary %>%
mutate(
bathrooms_per_capacity = bathrooms/accommodates
)
Evaluating minimum_nights Stay
Now, we want to find the most common value for minimum nights:
#here we find out the most common value for minimum mights
count_minimum_nights <- listings_necessary %>%
count(minimum_nights) %>%
arrange(desc(n))
#we tabulate our output for interpretability
kbl(count_minimum_nights,align = 'l',col.names = c("Minimum Nights","Number of Listings")) %>%
kable_styling()
| Minimum Nights | Number of Listings |
|---|---|
| 2 | 8500 |
| 3 | 6479 |
| 1 | 5444 |
| 4 | 3117 |
| 5 | 2346 |
| 7 | 997 |
| 6 | 702 |
| 14 | 198 |
| 10 | 170 |
| 30 | 136 |
| 8 | 52 |
| 20 | 49 |
| 15 | 35 |
| 21 | 33 |
| 60 | 23 |
| 9 | 22 |
| 12 | 21 |
| 90 | 21 |
| 13 | 20 |
| 25 | 16 |
| 28 | 14 |
| 50 | 14 |
| 100 | 8 |
| 31 | 7 |
| 11 | 6 |
| 23 | 6 |
| 29 | 6 |
| 45 | 5 |
| 16 | 4 |
| 19 | 4 |
| 200 | 4 |
| 1000 | 4 |
| 17 | 3 |
| 40 | 3 |
| 70 | 3 |
| 80 | 3 |
| 120 | 3 |
| 18 | 2 |
| 22 | 2 |
| 24 | 2 |
| 26 | 2 |
| 27 | 2 |
| 49 | 2 |
| 85 | 2 |
| 160 | 2 |
| 180 | 2 |
| 360 | 2 |
| 365 | 2 |
| 999 | 2 |
| 34 | 1 |
| 35 | 1 |
| 39 | 1 |
| 42 | 1 |
| 43 | 1 |
| 48 | 1 |
| 56 | 1 |
| 59 | 1 |
| 61 | 1 |
| 66 | 1 |
| 75 | 1 |
| 89 | 1 |
| 92 | 1 |
| 102 | 1 |
| 107 | 1 |
| 150 | 1 |
| 270 | 1 |
| 300 | 1 |
| 430 | 1 |
| 500 | 1 |
| 1100 | 1 |
We find that, in descending order of commonality, the most common values for the variable minimum_nights are 2, 3, 1, 4 and 5. A value of 2, the most common value, stands out. We assume that one of the reasons for 2 minimum nights stay being the most common is the considerable decrease in host cleaning fee (expense), as compared with cleaning fee (expenses) for a 1 night stay. This means that the majority of hosts set a 2 night minimum in order to keep cleaning fees (and thus the overall price of the stay) down.
In order to substantiate this reasoning, we put cleaning fee into relation of the minimum nights stay:
listings_necessary %>%
select(minimum_nights, cleaning_fee) %>%
group_by(minimum_nights) %>%
summarise(mean = mean(cleaning_fee)) %>%
arrange(desc(mean)) %>%
kbl(align = 'l',col.names = c("Minimum Nights","Average Cleaning Fee")) %>%
kable_styling()
| Minimum Nights | Average Cleaning Fee |
|---|---|
| 24 | 1500.0 |
| 35 | 1250.0 |
| 89 | 1000.0 |
| 56 | 450.0 |
| 200 | 375.0 |
| 1000 | 317.0 |
| 300 | 300.0 |
| 270 | 298.0 |
| 430 | 250.0 |
| 160 | 249.0 |
| 28 | 247.9 |
| 30 | 243.3 |
| 999 | 243.0 |
| 7 | 232.7 |
| 70 | 231.7 |
| 1 | 226.4 |
| 5 | 221.7 |
| 90 | 221.3 |
| 50 | 220.0 |
| 6 | 213.9 |
| 4 | 213.7 |
| 3 | 209.0 |
| 11 | 208.2 |
| 8 | 205.3 |
| 34 | 200.0 |
| 39 | 200.0 |
| 100 | 200.0 |
| 120 | 200.0 |
| 2 | 196.3 |
| 31 | 195.6 |
| 14 | 193.7 |
| 20 | 190.9 |
| 15 | 181.1 |
| 45 | 180.0 |
| 21 | 173.4 |
| 13 | 172.1 |
| 25 | 159.3 |
| 10 | 159.0 |
| 12 | 153.0 |
| 85 | 150.0 |
| 150 | 150.0 |
| 29 | 141.7 |
| 60 | 139.9 |
| 19 | 137.5 |
| 9 | 117.2 |
| 17 | 116.7 |
| 80 | 116.7 |
| 16 | 100.0 |
| 22 | 100.0 |
| 42 | 100.0 |
| 40 | 89.7 |
| 23 | 53.2 |
| 1100 | 50.0 |
| 18 | 0.0 |
| 26 | 0.0 |
| 27 | 0.0 |
| 43 | 0.0 |
| 48 | 0.0 |
| 49 | 0.0 |
| 59 | 0.0 |
| 61 | 0.0 |
| 66 | 0.0 |
| 75 | 0.0 |
| 92 | 0.0 |
| 102 | 0.0 |
| 107 | 0.0 |
| 180 | 0.0 |
| 360 | 0.0 |
| 365 | 0.0 |
| 500 | 0.0 |
Above table substantiates our reasoning, since the average cleaning fee for listings with a minimum 1 night stay is considerably higher than that for listings with a minimum 2 night stay.
Excluding Listings with minimum_nights<=4
We exclude listings for minimum nights greater than 4, since our analysis requires us to calculate the price of a 4 night stay in Copenhagen for 2 guests.
listings_necessary <- listings_necessary %>%
filter(minimum_nights <= 4) # we use the less than, equal to sign in order to filter our selection
We now ensure that our data wrangling stages have been successful using the skim function:
#we check that our recoding, imputation and variable creation has been successful, and find that it has been
skim(listings_necessary)
| Name | listings_necessary |
| Number of rows | 23540 |
| Number of columns | 52 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| Date | 3 |
| factor | 7 |
| logical | 7 |
| numeric | 31 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 33 | 37 | 0 | 23540 | 0 |
| house_rules | 10980 | 0.53 | 1 | 1000 | 0 | 11313 | 0 |
| host_verifications | 0 | 1.00 | 2 | 156 | 0 | 295 | 0 |
| amenities | 0 | 1.00 | 2 | 1179 | 0 | 22096 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| host_since | 6 | 1.00 | 2008-06-27 | 2020-06-26 | 2015-07-10 | 3281 |
| first_review | 3707 | 0.84 | 2009-09-04 | 2020-06-28 | 2017-08-20 | 2481 |
| last_review | 3707 | 0.84 | 2011-08-10 | 2020-06-28 | 2019-08-05 | 1769 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| neighbourhood | 0 | 1 | FALSE | 21 | Nør: 4386, Ves: 3568, Ind: 2955, Fre: 2707 |
| neighbourhood_cleansed | 0 | 1 | FALSE | 11 | Nrr: 4378, Ves: 3710, Ind: 3205, Fre: 2715 |
| property_type | 0 | 1 | FALSE | 27 | Apa: 19948, Con: 1422, Hou: 912, Tow: 400 |
| room_type | 0 | 1 | FALSE | 4 | Ent: 19568, Pri: 3860, Sha: 74, Hot: 38 |
| bed_type | 0 | 1 | FALSE | 5 | Rea: 23218, Pul: 180, Fut: 72, Cou: 38 |
| cancellation_policy | 0 | 1 | FALSE | 4 | fle: 10376, mod: 7444, str: 5439, sup: 281 |
| prop_type_simplified | 0 | 1 | FALSE | 5 | Apa: 19948, Con: 1422, Hou: 912, Oth: 858 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 0 | 1 | 0.11 | FAL: 20878, TRU: 2662 |
| host_has_profile_pic | 0 | 1 | 1.00 | TRU: 23482, FAL: 58 |
| host_identity_verified | 0 | 1 | 0.37 | FAL: 14835, TRU: 8705 |
| neighbourhood_group_cleansed | 23540 | 0 | NaN | : |
| is_location_exact | 0 | 1 | 0.79 | TRU: 18574, FAL: 4966 |
| instant_bookable | 0 | 1 | 0.29 | FAL: 16722, TRU: 6818 |
| is_business_travel_ready | 0 | 1 | 0.00 | FAL: 23540 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| host_response_rate | 18798 | 0.20 | 90.22 | 26.11 | 0.00 | 100.00 | 100.00 | 100.00 | 100.0 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| host_acceptance_rate | 9766 | 0.59 | 74.56 | 31.60 | 0.00 | 53.00 | 89.00 | 100.00 | 100.0 | <U+2581><U+2581><U+2582><U+2582><U+2587> |
| host_listings_count | 0 | 1.00 | 6.33 | 37.65 | 0.00 | 1.00 | 1.00 | 1.00 | 737.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| zipcode | 672 | 0.97 | 2063.43 | 375.61 | 20.00 | 1824.00 | 2200.00 | 2300.00 | 4672.0 | <U+2581><U+2583><U+2587><U+2581><U+2581> |
| latitude | 0 | 1.00 | 55.68 | 0.02 | 55.62 | 55.67 | 55.68 | 55.70 | 55.7 | <U+2581><U+2583><U+2587><U+2587><U+2581> |
| longitude | 0 | 1.00 | 12.56 | 0.03 | 12.45 | 12.54 | 12.55 | 12.58 | 12.6 | <U+2581><U+2582><U+2587><U+2585><U+2582> |
| accommodates | 0 | 1.00 | 3.26 | 1.59 | 1.00 | 2.00 | 3.00 | 4.00 | 16.0 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| bathrooms | 0 | 1.00 | 1.06 | 0.27 | 0.00 | 1.00 | 1.00 | 1.00 | 10.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bedrooms | 0 | 1.00 | 1.51 | 1.07 | 0.00 | 1.00 | 1.00 | 2.00 | 101.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 0 | 1.00 | 1.96 | 1.39 | 0.00 | 1.00 | 2.00 | 2.00 | 25.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| square_feet | 23245 | 0.01 | 709.13 | 554.79 | 0.00 | 140.00 | 753.00 | 1033.00 | 2691.0 | <U+2587><U+2587><U+2583><U+2581><U+2581> |
| price | 0 | 1.00 | 846.70 | 1034.46 | 0.00 | 498.00 | 698.00 | 996.00 | 69175.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| security_deposit | 0 | 1.00 | 728.89 | 2425.88 | 0.00 | 0.00 | 0.00 | 0.00 | 35000.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| cleaning_fee | 0 | 1.00 | 209.05 | 247.85 | 0.00 | 0.00 | 150.00 | 300.00 | 4000.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| guests_included | 0 | 1.00 | 1.52 | 1.03 | 1.00 | 1.00 | 1.00 | 2.00 | 14.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| extra_people | 0 | 1.00 | 65.49 | 125.42 | 0.00 | 0.00 | 0.00 | 100.00 | 2024.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1.00 | 2.31 | 0.97 | 1.00 | 2.00 | 2.00 | 3.00 | 4.0 | <U+2585><U+2587><U+2581><U+2586><U+2583> |
| number_of_reviews | 0 | 1.00 | 14.80 | 28.98 | 0.00 | 2.00 | 6.00 | 16.00 | 637.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_ltm | 0 | 1.00 | 3.08 | 7.17 | 0.00 | 0.00 | 0.00 | 4.00 | 376.0 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 0 | 1.00 | 95.43 | 6.26 | 20.00 | 94.00 | 97.00 | 100.00 | 100.0 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_accuracy | 0 | 1.00 | 9.78 | 0.61 | 2.00 | 10.00 | 10.00 | 10.00 | 10.0 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_cleanliness | 0 | 1.00 | 9.49 | 0.90 | 2.00 | 9.00 | 10.00 | 10.00 | 10.0 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_checkin | 0 | 1.00 | 9.85 | 0.51 | 2.00 | 10.00 | 10.00 | 10.00 | 10.0 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_communication | 0 | 1.00 | 9.88 | 0.48 | 2.00 | 10.00 | 10.00 | 10.00 | 10.0 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_location | 0 | 1.00 | 9.67 | 0.63 | 2.00 | 9.00 | 10.00 | 10.00 | 10.0 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_value | 0 | 1.00 | 9.54 | 0.73 | 2.00 | 9.00 | 10.00 | 10.00 | 10.0 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| reviews_per_month | 0 | 1.00 | 0.50 | 0.72 | 0.01 | 0.17 | 0.28 | 0.56 | 30.6 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| accommodates_simplified | 0 | 1.00 | 3.22 | 1.46 | 1.00 | 2.00 | 3.00 | 4.00 | 7.0 | <U+2587><U+2582><U+2585><U+2581><U+2582> |
| bedrooms_per_capacity | 0 | 1.00 | 0.48 | 0.21 | 0.00 | 0.40 | 0.50 | 0.50 | 16.8 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds_per_capacity | 0 | 1.00 | 0.60 | 0.26 | 0.00 | 0.50 | 0.50 | 0.75 | 4.0 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| bathrooms_per_capacity | 0 | 1.00 | 0.39 | 0.19 | 0.00 | 0.25 | 0.33 | 0.50 | 7.5 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
skim() shows us that there are 31 ‘numeric’ variables and 7 ‘factor’ variables. We summarise their names in tables in the subsequent sub-section.
We create three different lists, summarising with numeric, logical and factor variables, respectively.
Numeric Variables
#we isolate and tabulate numeric variables
num <- listings_necessary %>%
select_if(is.numeric) %>% #selecting numeric variables
select(-c("latitude", "longitude", "zipcode")) %>% #excluding irrelevant variables
na.omit()
list_num <- c(colnames(num)) #create a list of names of the variables
kbl(list_num, col.names="Numeric Variables") %>% #creating a table of the list
kable_styling()
| Numeric Variables |
|---|
| host_response_rate |
| host_acceptance_rate |
| host_listings_count |
| accommodates |
| bathrooms |
| bedrooms |
| beds |
| square_feet |
| price |
| security_deposit |
| cleaning_fee |
| guests_included |
| extra_people |
| minimum_nights |
| number_of_reviews |
| number_of_reviews_ltm |
| review_scores_rating |
| review_scores_accuracy |
| review_scores_cleanliness |
| review_scores_checkin |
| review_scores_communication |
| review_scores_location |
| review_scores_value |
| reviews_per_month |
| accommodates_simplified |
| bedrooms_per_capacity |
| beds_per_capacity |
| bathrooms_per_capacity |
Factor Variables
#we isolate and tabulate categorical (factor) variables
factor <- listings_necessary %>%
select_if(is.factor) #selecting categorical variables
list_factor <- c(colnames(factor)) #create a list of names of the variables
kbl(list_factor, col.names="Factor Variables") %>% #creating a table of the list
kable_styling()
| Factor Variables |
|---|
| neighbourhood |
| neighbourhood_cleansed |
| property_type |
| room_type |
| bed_type |
| cancellation_policy |
| prop_type_simplified |
Logical Variables
#we isolate and tabulate logical variables
logical <- listings_necessary %>%
select_if(is.logical) #selecting logical variables
list_logical <- c(colnames(logical)) #create a list of names of the variables
kbl(list_logical, col.names="Logical Variables") %>% #creating a table of the list
kable_styling()
| Logical Variables |
|---|
| host_is_superhost |
| host_has_profile_pic |
| host_identity_verified |
| neighbourhood_group_cleansed |
| is_location_exact |
| instant_bookable |
| is_business_travel_ready |
For the continuous variables we use density plots to look at the data. We group the continuous variables into four different groups: property-related, host-related, review-related and price-related variables.
Property-Related Variables
flat_longer <- num %>%
select("accommodates_simplified", "bedrooms_per_capacity", "beds_per_capacity", "bathrooms_per_capacity", "square_feet", "guests_included", "minimum_nights", "bedrooms"
) %>%
pivot_longer(names_to= "variable_name", values_to="values", everything() )
ggplot(flat_longer, aes(x=values), na.rm=TRUE)+
geom_density(fill="grey")+
facet_wrap(vars(variable_name), scales="free", ncol=4)+
labs(x="",title="Exploring Property-Related Variables for Investigation of Copenhagen AirBnB Prices",subtitle="Density plot of selected continuous variables", y="Density") +
theme_fivethirtyeight() +
theme(axis.title = element_text()) +
theme(axis.text.y=element_blank()) +
NULL

The distribution of the above selected variables is not unified, and we therefore take this into acccount as we build our models. We note specifically that Bedroom per capacity has most values around 0.5 and could therefore be less valuable than the total count of bedrooms.
Host-Related Variables
host_longer <- num %>%
select("host_response_rate", "host_acceptance_rate", "host_listings_count") %>%
pivot_longer(names_to= "variable_name", values_to="values", everything() )
ggplot(host_longer, aes(x=values), na.rm=TRUE)+
geom_density(fill="grey")+
facet_wrap(vars(variable_name), scales="free", ncol=4)+
labs(x="",title="Exploring Host-Related Variables for Investigation of Copenhagen AirBnB Prices",subtitle="Density plot of selected continuous variables", y="Density") +
theme_fivethirtyeight() +
theme(axis.title = element_text()) +
theme(axis.text.y=element_blank()) +
NULL
While the variable host_response_rate has most values around 100% and is therefore not interesting for further analysis, we are interested in evaluating host_acceptance_rate and host_listings_count.
Review-Related Variables
review_longer <- num %>%
select("number_of_reviews", "number_of_reviews_ltm", "review_scores_rating", "review_scores_accuracy", "review_scores_cleanliness", "review_scores_checkin", "review_scores_communication", "review_scores_location", "review_scores_value", "reviews_per_month") %>%
pivot_longer(names_to= "variable_name", values_to="values", everything())
ggplot(review_longer, aes(x=values), na.rm=TRUE)+
geom_density(fill="grey")+
facet_wrap(vars(variable_name), scales="free", ncol=4)+
labs(x="",title="Exploring Review-Related Variables for Investigation of Copenhagen AirBnB Prices",subtitle="Density plot of selected continuous variables", y="Density") +
theme_fivethirtyeight() +
theme(axis.title = element_text()) +
theme(axis.text.y=element_blank()) +
NULL

Although most of the variables have heavily skewed distributions and tightly clustered observations which makes drawing inference challenging, we may distinguish an interesting distribution of review_scores_rating. At this stage, we speculate that review variables may account for significant differences between listing prices, and thus be a likely determinant of the price of a stay in Copenhahgen.
Price-Related Variables
price_longer <- num %>%
select("price", "security_deposit", "cleaning_fee", "extra_people") %>%
pivot_longer(names_to= "variable_name", values_to="values", everything() )
ggplot(price_longer, aes(x=values), na.rm=TRUE)+
geom_density(fill="grey")+
facet_wrap(vars(variable_name), scales="free", ncol=4)+
labs(x="",title="Exploring Price-Related Variables for Investigation of Copenhagen AirBnB Prices",subtitle="Density plot of selected continuoUs variables", y="Density") +
theme_fivethirtyeight() +
theme(axis.title = element_text()) +
theme(axis.text.y=element_blank()) +
scale_x_continuous(labels=label_dollar())+
NULL
Overall, our goal is to explain and be able to predict the prices of renting AirBnB properties in Copenhagen. The price will be calculated by summarizing the price, cleaning fee and any additional fees for extra people not included in the base rate of the listing. Consequently, it is important that we know how the price components are distributed. In particular, a price-related variable that draws attention is cleaning_fee, whose distribution indicates rather high variability. This may indeed be an influential factor in further analysis.
For the logical values we create a table counting FALSE and TRUE for each logical variable:
#we tidy logical variables
logical_values_clean <- listings_necessary %>%
select_if(is.logical) %>% #we check for logical values
select(-c("neighbourhood_group_cleansed")) %>% #we do not want to include neighbourhood_group_cleansed, since it is also a logical variable
pivot_longer(names_to = "type", values_to ="values", cols=1:6) %>% #we pivot the table in order to have it cleansed
group_by(type, values) %>%
summarise(count= n()) %>%
pivot_wider(names_from = "values", values_from ="count")
logical_values_clean %>%
mutate(type = case_when(
type =="host_has_profile_pic" ~ "Host has profile picture",
type =="host_identity_verified" ~ "Host identity is verified",
type =="host_is_superhost" ~ "Host is a superhost",
type =="is_business_travel_ready" ~ "Flat is business-travel ready",
type =="is_location_exact" ~ "Location is exact",
type =="instant_bookable" ~ "Flat is instantly bookable"
)) %>%
kbl(col.names=c("Logical Variable", "False Count","True Count" ), caption="Count of True and False for every Logical Variable") %>%
kable_styling() # we want a nice style for our table
| Logical Variable | False Count | True Count |
|---|---|---|
| Host has profile picture | 58 | 23482 |
| Host identity is verified | 14835 | 8705 |
| Host is a superhost | 20878 | 2662 |
| Flat is instantly bookable | 16722 | 6818 |
| Flat is business-travel ready | 23540 | NA |
| Location is exact | 4966 | 18574 |
The count of true values in “Flat is business travel ready” is NA because every value is FALSE.
As for this step of the analysis we may observe that logical variables such as: host_has_profile_pic, is_business_travel_ready and host_is_superhost would probably not account for much of variability in price (the outcome variable of our analyses), since most of their corresponding values are rather homogeneous. Contrastingly, due to their lack of homogeneity, it seems that host’s identity, ability to instant book the flat and exact location of flat may account for variability in prices to a greater degree.
Now, we create barcharts to evaluate our categorical variables:
factor_longer <- factor %>% #We tidy the data into a neat format
select(-property_type) %>%
pivot_longer(names_to= "variable_name", values_to="values", everything() ) %>%
group_by(variable_name,values) %>% #We precount variables in order to sort them
mutate(count_name_occurr = n() )
ggplot(factor_longer, aes(y=reorder(values,count_name_occurr) , na.rm=TRUE))+ # We create a sorted barplot with variables
geom_bar(fill="grey", orientation ="y")+
facet_wrap(vars(variable_name), scales="free", ncol=3)+
labs(title="Exploring the Distribution of Categorical Variables for Investigation of Copenhagen AirBnB Prices",subtitle="Faceted bar chart of selected categorical variables")+
theme_fivethirtyeight() +
theme(axis.title = element_text()) +
NULL

We may observe that the categorical variable bed_type will probably not account for much variability in price, since most of its corresponding values are homogeneous. However, consistent with this rationale, it seems that neighborhood, room_type, prop_type_simplified and cancellation_policy may account for variability in prices to a greater extent, since they are more evenly distributed and thus less homogeneous.
We inspect the numeric variables for correlations among them. Therefore, we create a correlation matrix and visualize it with the function corrplot(). We excluded some variables due to above mentioned reasoning, and also excluded all review-related variables, as we will look at them separately.
correlation_matrix <- num %>%
select(-c(number_of_reviews_ltm, bedrooms_per_capacity, beds_per_capacity, reviews_per_month, accommodates_simplified, host_response_rate, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, reviews_per_month)) %>%
cor()
corrplot(correlation_matrix,
method="color",
type="full",
title="Correlation matrix of selected numerical variables")
In the correlation plot above, bluer shades represent stronger positive correlations, and redder shades reflect stronger negative correlations. We observe that much of the correlation appears within specific groups of variables (e.g. listing-type related variables or review related variables), which is not surprising. Correlation between variables such as bedrooms, beds, accommodates and price was also anticipated, since more of any of these things implies (in most cases) more of the others.
Review-Related Variables
Because all the review-related variables could be highly correlated, we look at them in detail:
listings_necessary %>%
select(review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, reviews_per_month) %>%
ggpairs(aes(alpha=0.1))+
theme_bw()
As expected, the correlations among the review-related variables are very high, but none of the values is higher than 0.67 so we do not need to worry about them.
Listing-Related Variables
Now, we evaluate whether listing-related variables correlate:
listings_necessary %>%
select(square_feet, accommodates_simplified, bedrooms, beds, bathrooms_per_capacity, prop_type_simplified) %>%
ggpairs(aes(alpha=0.1, color= prop_type_simplified))+
theme_bw()
Here we can see that some listing-related variables do correlate strongly. Beds and Bedrooms, for example, exceed 0.7 for House listings and other property types. It is also highly noticeable that property types bundled as “Other” correlate more strongly with most other variables than the other clearly-defined property types. In order to avoid problematic multicollinearity, we need to take these relationships into consideration when building our models.
A Selection of Interesting Variables
Having evaluated the key correlations between both Review-Related and Listing-Related variables, we now evaluate a mix of interesting variables and see if they correlate with one another, in an effort to glean a better understanding of which variable combinations may cause problematic multicollinearity in our final model:
listings_necessary %>%
select(number_of_reviews, review_scores_rating, bathrooms, accommodates, host_is_superhost) %>%
GGally::ggpairs(aes(alpha = 0.4, colour=host_is_superhost),
) +
theme_bw()
Reassuringly, we detect no strong correlation between these values, which encourages us to employ them together in a model to predict the price of a 4 night stay in an Copenhagen AirBnB for 2 guests.
#we set colors for price
price_color <- colorNumeric(palette = "Reds",
domain = c(1:2000),
reverse = FALSE)
#we initiate the leaflet map
map <-leaflet(data = listings_necessary) %>%
# Map with OpenStreetMap.Mapnik
addProviderTiles("OpenStreetMap.Mapnik") %>%
# Map centered on the coordinate of Copenhagen with zoom of 11
setView(lat = 55.6761, lng = 12.5683, zoom = 11) %>%
# Add circle markers with popups and labels
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~price_color(price),
fillOpacity=0.6,
popup = ~listing_url,
label = ~property_type)
#print map
map
As would be expected, we find that AirBnBs are most densely clustered in the center of the city, and are highest priced in the most popular tourist spots, including Gammel Strand, Norreport and Kongens Nytorv. The most dense clusterings can be found around the lakes in Norrebro, particularly on the North West side, which is less green than the Southern bank. Interestingly, both the sea bank to the west and also the worse connected area of Brygge also command high Airbnb prices, especially relative to areas such as Fredriks Brygge to the North West. In fact it is clear that the further North or South from the city center in Kongens Nytorv and Nyhavn, the lower priced AirBnBs are on average. This holds to a greater extent further from the coastline - despite being relatively far from the center, Amager still commands relatively high Airbnb prices.
However, this map has a huge amount of datapoints. Though this is not problematic for identifying which areas have expensive or cheap AirBnBs, which makes it hard to identify which areas have the highest density of AirBnBs. Therefore, we cluster the data points for ease of interpretation.
map_cluster <-leaflet(data = listings_necessary) %>%
# Map with OpenStreetMap.Mapnik
addProviderTiles("OpenStreetMap.Mapnik") %>%
# Map centered on the coordinate of Copenhagen with zoom of 11
setView(lat = 55.6761, lng = 12.5683, zoom = 11) %>%
# Add circle markers with popups and labels
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillOpacity=0.6,
popup = ~listing_url,
label = ~property_type,
# cluster
clusterOptions = markerClusterOptions())
#print map_cluster
map_cluster
This map demonstrates how central Copenhagen has over 17000 separate AirBnB listings, over 5x more than Amager, the region with the next highest number of listings. As would be expected, the area we identified as commanding the highest Airbnb prices, the popular tourist areas of Norreport, Kongens Nytorv and Nyhavn, has the highest number (5000) of listings. Further, Norrebro and Southern Osterbro, i.e. the lakes of the city, have almost 3000 Airbnb listings in a relatively small area (especially since much of it is occupied by water). As we saw in the earlier map, there are also extensive listings in Norrebro, despite commanding on average a lower price than the very center of the city.
Now we know where the most expensive AirBnBs are, and the neighborhoods in which most AirBnBs are clustered, we are interested in knowing whether there is a geographical pattern to the type of AirBnB’s available in Copenhagen. Given the high price of AirBnBs in the city center, perhaps we would expect more Townhouses and Houses, but perhaps the price of the listings is just a function of their location, and therefore the opposite may be true - all the expensive inner-city listings may only be Apartments or Condominiums. Since these are mere speculations, we now investigate potential relationships further.
# group by prop_type_simplified
apartment <- listings_necessary %>%
filter(prop_type_simplified == 'Apartment')
condominium <- listings_necessary %>%
filter(prop_type_simplified == 'Condominium')
house <- listings_necessary %>%
filter(prop_type_simplified == 'House')
townhouse <- listings_necessary %>%
filter(prop_type_simplified == 'Townhouse')
# set colors for price of different property types
apartment_color <- colorNumeric(palette = "Reds",
domain = c(1:2000),
reverse = FALSE)
condominium_color <- colorNumeric(palette = "Blues",
domain = c(1:2000),
reverse = FALSE)
house_color <- colorNumeric(palette = "Greens",
domain = c(1:2000),
reverse = FALSE)
townhouse_color <- colorNumeric(palette = "Yellows",
domain = c(1:2000),
reverse = FALSE)
# initiate leaflet map
map_prop_type <-leaflet() %>%
# Map with OpenStreetMap.Mapnik
addProviderTiles("OpenStreetMap.Mapnik") %>%
# Add circle markers with popups and labels for apartment
addCircleMarkers(data = apartment,
group = "apartment",
lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~apartment_color(price),
fillOpacity=0.6,
popup = ~listing_url,
label = ~property_type) %>%
# Add circle markers with popups and labels for condominium
addCircleMarkers(data = condominium,
group = "condominium",
lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~condominium_color(price),
fillOpacity=0.6,
popup = ~listing_url,
label = ~property_type) %>%
# Add circle markers with popups and labels for house
addCircleMarkers(data = house,
group = "house",
lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~house_color(price),
fillOpacity=0.6,
popup = ~listing_url,
label = ~property_type) %>%
# Add circle markers with popups and labels for townhouse
addCircleMarkers(data = townhouse,
group = "townhouse",
lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~townhouse_color(price),
fillOpacity=0.6,
popup = ~listing_url,
label = ~property_type) %>%
# set layers by property type
addLayersControl(overlayGroups = c("apartment", "condominium","house","townhouse")) %>%
# Map centered on the coordinate of Copenhagen with zoom of 11
setView(lat = 55.6761, lng = 12.5683, zoom = 11)
# print map_prop_type
map_prop_type
This map allows us to identify some patterns in the distribution of certain types of AirBnB listings across Copenhagen. If we reduce the map down (using the layers in the top right hand corner) to just houses, we find that most AirBnB listings for houses are in Bronshoj, Uttersley and Southern Amager, close to Kastrup. Generally, the same is also true for Townhouses, which are scattered largely in Amager, Fredriksberg and Bronshoj. However, Townhouses also seem to be clustered (into small clusters of 5-10 listings) on the coastline in the North East of the city, and also on the Southern side of the lakes, towards their Eastern limit on the way out of the city center and towards Osterbro, which is far more residential. As we anaklyse Condominiums, we find the price pattern we observed earlier becomes clearer - the more central the AirBnB, the higher the price. Though there are AirBnB condominiums across the whole city, these do seem largely clustered around the lakes, Christianshavn, and Vesterbro, with the most expensive listings being in Kongens Nytorv and Fredriksberg. This mapping also allows us to see how Apartments are by far the most common listing of all in Copenhagen, with more Apartment listings than all other types combined, and that in general the price of AirBnB listings which are apartments falls sharply on leaving the main waterways of the city center.
Most evidently, by separating out listing types in this way, we see how geographical heterogeneity in price differs between listing types. Specifically, we find that while Apartment and Condominium prices decline as the distance to the city center increases, House and Townhouse prices are not overtly correlated with centrality in the same way - some of the most expensive House and Townhouse listings being on the Eastern coastline, the Northern lakes, and way out of the city in Valby, Vanlose and to the North of Fredriksberg.
From these maps, we can conclude that:
the vast majority of listings, as we would expect, are in the city centre and in key areas (highlighted in our cluster map) around the center such as Norrebro, Osterbro, Vesterbro and Fredriksberg
there are key differences in the number of AirBnB listings of certain property types, and the most common property type by far is an Apartment
without separating listings by property type, we find that on average listings are more highly priced in central areas, especially on the inland waterways and Eastern coastline, such as the lakes and Amager
if we discriminate by property type, we find there is heterogeneity in the relationship between centrality and price across property types, and that while Apartment and Condominium listings are most expensive in the city center, and decrease in price with distance from tourist hot-spots, House and Townhouse exhibit no such relationship and are generally clustered in similar neighborhoods
While these are interesting conclusions, the presence of expensive listings of all property types away from the city center, and in less desirable, traditionally residential areas which require use of trains and other public transport to access city sights demonstrates that location (or rather, proximity to city center) cannot be the only predictor of AirBnB listing price. Further, we have only analysed the base listing price so far, and not the actual cost to the AirBnB guest, which includes additional guest fees and cleaning fees.
Therefore, we run a number of regression models on test and training data to glean a greater insight into which factors, besides geography, are associated with the cost of an Airbnb stay in Copenhagen. In order to achieve this, we need to evaluate not only the base list price of Copenhagen AirBnBs, but use a forecast price as our outcome variable. Therefore, we simulate a popular scenario: a 4-night visit to Copenhagen for 2. Using this assumed “trip”, we standardise the prices of all listings to the price that would be paid by 2 visitors staying at the listing, under this scenario.
Creating the ‘price_4_nights’ variable
In order to run our regression analyses and better understand factors associated with listing prices in Copenhagen, we create a new variable, ‘price_4_nights’, which calculates the total cost for two people to stay at each Airbnb property for 4 nights. We apply the following formula to define the variable:
i) if the number of guests included with the base nightly rate is greater than or equal to 2 (i.e. 2 people would only need to pay the base nightly rate), then we calculate the rate for 4 nights by multiplying the base rate by 4 and adding the cleaning fee.
ii) if the number of guests included with the base nightly rate is less than 2 (i.e., it is 1), then we calculate the nightly rate by adding the cost for an additional guest (the second guest) to the base rate, multiplying this sum by 4 and then adding the cleaning fee.
#we create the price_4_nights variable using the case_when() function
listings_necessary <-
listings_necessary %>%
mutate(
#our conditional statement says that if the number of guests included in the price is greater than or equal to 2, the price is just the cleaning fee plus 4* the nightly rate, since the 2 people will stay for 4 nights
price_4_nights = case_when(
#however, if the number of guests included in the price is less than 2, meaning that a party of 2 would have to pay for exactly 1 extra guest, then the price will be 4* the nightly price, which is the base price plus the price for that extra guest, plus the cleaning fee
guests_included >= 2 ~ cleaning_fee + 4*price,
TRUE ~ 4 * (price + extra_people) +cleaning_fee
)
)
#when we run skim() to evaluate our new variable, we also find that the minimum value is 0, which will cause errors when log transformed
skim(listings_necessary$price_4_nights)
| Name | listings_necessary$price_… |
| Number of rows | 23540 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 3657 | 4194 | 0 | 2204 | 3092 | 4193 | 276700 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
#therefore, we examine how many observations have a valuje of 0 for the price_4_nights variable
summary((listings_necessary$price_4_nights)==0)
## Mode FALSE TRUE
## logical 23539 1
#we find 1 observation has a value of 0 for this variable, and subsequently filter it out
listings_necessary <- listings_necessary %>%
filter(price_4_nights != 0)
#we ensure the erroneous price_4_nights 0 value has been removed from the working dataset, and find that it has been
summary((listings_necessary$price_4_nights)==0)
## Mode FALSE
## logical 23539
To glean a preliminary understanding of how location is related to price_4_nights, we plot a map of the price_4_nights variable
# set colors for price
price_color <- colorNumeric(palette = "Reds",
domain = c(1:10000),
reverse = FALSE)
# initiate leaflet map
stay_price_map <-leaflet(data = listings_necessary) %>%
# Map with OpenStreetMap.Mapnik
addProviderTiles("OpenStreetMap.Mapnik") %>%
# Map centered on the coordinate of Copenhagen with zoom of 11
setView(lat = 55.6761, lng = 12.5683, zoom = 11) %>%
# Add circle markers with popups and labels
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~price_color(price_4_nights),
popup = ~listing_url,
label = ~property_type)
#print map
stay_price_map
As we would expect, we find that the highest priced AirBnB stays are located in the very center of the city. However, interestingly, the 4 nights for 2 people variable seems to show that even in areas of Osterbro, which are relatively close to the city center, prices are not as high (relative to the other listings) as when we evaluate the distribution of nightly rates. This implies that listings which seemed relatively expensive slightly further from the city center on a nightly basis are actually relatively cheaper on a 4-nightly basis for 2 people, perhaps because they have lower cleaning fees (which we did not account for in the earlier mappings) or include more guests without the need for additional fees.
Though this visualisation is informative, it does not allow us to gain a deep understanding of the nature of our outcome variable, its distribution, or whether it should be transformed in order to generate valid and accurate regressions. Therefore, we analyse the distributions of our outcome variable, and the log of our outcome variable, to evaluate its nature and whether transformation is necessary.
Now, we analyse the distributions of both the ‘price_4_nights’ and log(price_4_nights) variables
#we calculate and report basic summary statistics for the average price for 2 people for 4 nights for Airbnb listings in Copenhagen
summary_price4nights <- summarise(listings_necessary, mean=mean(price_4_nights), median=median(price_4_nights), min=min(price_4_nights), max=max(price_4_nights))
summary_price4nights
#we summarise the distribution of price_4_nights without taking it's log
options(scipen=1000)
no_log_dist <- ggplot(listings_necessary) + geom_histogram(aes(price_4_nights), bins=80) + scale_x_continuous(labels = scales::dollar, breaks=seq(0, 20000, by = 1000), limits=c(0,20000)) + labs(x="Price: 2 Guests for 4 Nights", y="Count",title="The Distribution of Copenhagen Airbnb Stay Prices is Heavily Positively Skewed",subtitle="Distribution of prices for a 4-night Airbnb stay for a party of 2 in Copenhagen") + theme_fivethirtyeight() + theme(axis.title = element_text()) + geom_vline(xintercept=summary_price4nights$mean, size=0.5, color="red") + geom_vline(xintercept=summary_price4nights$median, size=0.5, color="blue") + annotate(geom = 'text', label = "Sample Mean", x = summary_price4nights$mean+70, y = 800, angle=270, colour="white") + annotate(geom = 'text', label = "Sample Median", x = summary_price4nights$median+70, y = 700, angle=270, colour="white")
#we now re-summarise the distribution of price_4_nights, having taken its log
#we calculate and report basic summary statistics for the average price for 2 people for 4 nights for Airbnb listings in Copenhagen
summary_log_price4nights <- summarise(listings_necessary, mean=mean(log(price_4_nights)), median=median(log(price_4_nights)), min=min(log(price_4_nights)), max=max(log(price_4_nights)))
summary_log_price4nights
#we summarise the distribution of log_price_4_nights without taking it's log
options(scipen=1000)
log_dist <- ggplot(listings_necessary) + geom_histogram(aes(log(price_4_nights)), bins=80) + scale_x_continuous(labels = scales::dollar, breaks=seq(0,13, by = 1), limits=c(0,13)) + labs(x="Price: 2 Guests for 4 Nights", y="Count",title="The Logged Distribution of Copenhagen Airbnb Stay Prices is Quasi-Normal",subtitle="Distribution of log prices for a 4-night Airbnb stay for a party of 2 in Copenhagen") + theme_fivethirtyeight() + theme(axis.title = element_text()) + annotate(geom = 'text', label = "Sample Median", x = summary_log_price4nights$median+0.15, y = 1700, angle=270, colour="white") + geom_vline(xintercept=summary_log_price4nights$median, size=0.5, color="blue")
#we plot the non-logged distribution alongside the logged distribution for ease of comparison
no_log_dist + log_dist

Note that in the un-logged distribution of price_4_nights above, we limited the histogram to a maximum price of $20,000 (when the actual maximum is almost $280,000). We added this limit because we believe that the substantial skew of the distribution is more evident when the range of the data is limited in the x-axis, and because the presence of extreme outliers is sufficiently demonstrated by values at $20,000, meaning that plotting extreme outliers at $280,000 is not of analytical importance. Contrastingly, the logged histogram is quasi-normally distributed, with no extreme outliers.
Subsequently, we choose to use the log(price_4_nights) variable in all our analyses
These plots demonstrate two key issues with the un-logged price data: firstly, it is positively skewed, with a broad spread of observations at the upper end of the distribution; and secondly, there are a number of very extreme outliers at the upper end of the distribution, reaching up to $280,000, with only those up to $20,000 shown in the distribution above.
Though normal distribution is not itself an assumption of linear regression (it is only necessary that Y, price_4_nights is normally distributed at each value of X, i.e. there is no heteroskedasticity), a normally distributed Y is favourable, since it is more likely to meet this assumption across many models. Therefore, in finding that log(price_4_nights) is distributed log normally, we favour this transformed variable in our subsequent regressions.
Since adjusted R squared can be misleading and lead to overfitting, we evaluate out-of-sample goodness of fit. This requires us to randomly partition the data into a training dataset containing 75% of the data, which is used to fit the model, and 25% testing data, used to assess model performance.
Throughout our model selection process, we build and interpret our models using all the data, before using our training data and testing data to evaluate overfitting via the Root Mean Squared Error (RMSE) and Adjusted R-Squared (R2).
set.seed(1234)
# create train_test_split, 75% for training 25% for testing
train_test_split <- initial_split(listings_necessary, prop = 0.75)
listings_train <- training(train_test_split)
listings_test <- testing(train_test_split)
For ease of interpretation, we use the term ‘stay’ to describe a 4-night stay in a Copenhagen Airbnb for 2 people. Therefore, the variable price_4_nights is the price for our ‘stay’.
Further, we apply a conventional 0.05 alpha level throughout, meaning that our coefficients are deemed statistically significant if their p-value is less than 0.05. When we say that a coefficient is statistically significant at all conventional significance levels, it is effectively 0, since conventional significance levels refers to alpha levels of 0.05, 0.01, 0.001 and 0.0001. Since the p-value is lower than all of these levels, it is deemed robustly statistically significant.
Next, we fit a regression model (model1) which takes log(price_4_nights) as its outcome variable, and 3 explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating
#we run a standard linear regression model with a selection of base variables
model1 <- lm(log(price_4_nights) ~ prop_type_simplified +
number_of_reviews +
review_scores_rating,
data = listings_train)
#we report the output of the linear regression model, with a 95% confidence interval, as a neat table
model1 %>%
tidy(conf.int = TRUE) %>%
mutate(estimate=exp(estimate), std.error = exp(std.error),conf.low=exp(conf.low),conf.high=exp(conf.high)) %>%
kbl(col.names=c("Term","Estimate","Standard Error","t-statistic","p-value","Lower CI","Upper CI")) %>%
kable_styling()
| Term | Estimate | Standard Error | t-statistic | p-value | Lower CI | Upper CI |
|---|---|---|---|---|---|---|
| (Intercept) | 1708.02 | 1.06 | 127.20 | 0.000 | 1522.93 | 1915.60 |
| prop_type_simplifiedCondominium | 1.08 | 1.02 | 4.56 | 0.000 | 1.04 | 1.11 |
| prop_type_simplifiedHouse | 1.29 | 1.02 | 12.62 | 0.000 | 1.24 | 1.34 |
| prop_type_simplifiedOthers | 1.60 | 1.02 | 22.80 | 0.000 | 1.53 | 1.66 |
| prop_type_simplifiedTownhouse | 1.42 | 1.03 | 11.75 | 0.000 | 1.34 | 1.50 |
| number_of_reviews | 1.00 | 1.00 | 1.15 | 0.251 | 1.00 | 1.00 |
| review_scores_rating | 1.01 | 1.00 | 9.56 | 0.000 | 1.00 | 1.01 |
#we evaluate initial model performance on the full dataset, specifically in terms of adjusted R2, and print this in a neat table
model1 %>%
glance() %>%
select(1:6) %>%
kbl(col.names=c("Model R Squared","Adjusted R Squared","Sigma","t-statistic","p-value","Degrees of Freedom")) %>%
kable_styling()
| Model R Squared | Adjusted R Squared | Sigma | t-statistic | p-value | Degrees of Freedom |
|---|---|---|---|---|---|
| 0.047 | 0.047 | 0.507 | 146 | 0 | 6 |
A Note on Interpreting our Anti-Logged Coefficients
Since our Y (outcome variable) is logged, we take the exponent, exp(), of the estimate to aid interpretibility. Specifically, this exp() coefficient tells us the ratio of the expected geometric mean price_4_nights for the coefficient group, versus the base group (for factor variables), or the ratio of the expected geometric mean for price_4_nights for each additional point increase in the explanatory variable (for numeric variables), ceteris paribus.
prop_type_simplified
Therefore, we can interpret these coefficients as percentage changes: a coefficient of 1.07 for Condominium implying that, relative to the base property type, Apartment, Condominium AirBnB listings are, on average, associated with a 7% higher price for our ‘stay’, ceteris paribus. In Model 1, we find that all levels of prop_type_simplified are statistically significant, at all conventional significance levels. Type “Others”, which includes listings which are not Apartments, Condominiums, Houses or Townhouses, is associated most strongly with ‘stay’ price: a ‘stay’ at an “Other” type of property is associated with a price 60% higher, on average, than an Apartment ‘stay’. Similarly, ceteris paribus, Townhouse ‘stays’ are associated with a 42% higher price, and House ‘stays’ are associated with a 29% higher price, than an Apartment Stay on average.
review_scores_rating
We find that each additional point rating in review score is associated with a 0.5% increase in the price of our ‘stay’, ceteris paribus. This estimate is also statistically significant at all conventional significance levels, and thus that, on average, stays at listings with higher ratings are more expensive than those at lower rated listings. Though this seems like a low coefficient, inevitably a 10-point difference in rating is associated with a relatively substantial 5% price increase - and thus the effect is cumulative and is likely to be substantive for most Airbnb guests.
Model 1 Performance
On evaluating the performance of our first model, we find that its adjusted r squared of 4.7% is extremely low. We use the adjusted R squared because this is far less likelto to be upwardly biased by the addition of new variables: it only increases if the new predictor enhances our model beyond what would be obtained by probability.
Explaining less than 5% of variation in our price_4_nights outcome variable, it is evident that these predictors are insufficient to allow effective prediction of the price of our stay. Therefore, we add another variable, room_type, to see if we can increase the performance of our model.
Model 1 Diagnostics
#we evaluate whether the 4 core assumptions of linear regression (Linear, Independence, Normal, Equality) hold
autoplot(model1, alpha = 0.3, label.size = 3) +
theme_bw()

We find that the residuals are patterned, indicating that there is a pattern (or multiple patterns) in our data which is currently not accounted for by the model itself. Further, we find that the residuals follow a quasi-normal distribution, which indicates our linear regression normality assumption is tenable. In terms of scale-location, the equal (or constant) variance assumption appears to hold, since there is no positive or negative trends across the fitted values. Finally, we identify quite a few observations for Apartment listings with high absolute residuals, indicating that these observations may have undue influence on estimates of model parameters.
#we use the Variance Inflation Factor (VIF) to evaluate if multicollinearity is problematic for our model
kbl(vif(model1), align="l")%>%
kable_styling(full_width = F)
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| prop_type_simplified | 1 | 4 | 1 |
| number_of_reviews | 1 | 1 | 1 |
| review_scores_rating | 1 | 1 | 1 |
Since all VIF shown are less than 5, we can safely conclude that this model does not have collinearity problems, and no variables require removal.
#we run the model on training data and construct both a predictions and RMSE output
rmse_train<-
listings_train %>%
mutate(predictions = exp(predict(model1, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#we run the model on test data and construct both a predictions and RMSE output
rmse_test <-
listings_test %>%
mutate(predictions = exp(predict(model1, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#we pull the R2 value from the training model
r2_train <-
summary(model1)$r.squared
#we pull the R2 value from the testing model
r2_test <-
cor(predict(model1, listings_test), listings_test$price_4_nights)
#we summarise the 4 values we have created or extracted in a neat table for evaluation
kbl(matrix(c(rmse_train, rmse_test, r2_train, r2_test),
nrow = 2,
dimnames = list(c("train","test"),
c("RMSE","R-squared"))),
) %>%
kable_styling()
| RMSE | R-squared | |
|---|---|---|
| train | 64597 | 0.047 |
| test | 51236 | 0.094 |
Strangely, the RMSE (Root Mean Squared Error) and R2 is considerably higher for the training dataset than the testing dataset, with the R2 of 9.6% for test data being over twice as high as that for training data. Regardless of the fact that the model appears to fit the out-of-sample testing data better than the training data, this considerable difference across RMSE and R2 suggests that the model may suffer from overfitting.
However, preliminary research suggests that the lower test error indicates that the model generalises well, but that the training set is considerably larger than the test set, as is the case in our analyses.
As we are also interested in determining if room_type is a significant predictor of the cost of the 4 nights (Y), ceteris paribus, we fit another regression model which includes the same explanatory variables as model1, plus the room_type variable
#see Model1 for detailed annotation of regression model creation and diagnostic stages
model2 <- lm(log(price_4_nights) ~ prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type,
data = listings_necessary)
model2 %>%
tidy(conf.int = TRUE) %>%
mutate(estimate=exp(estimate), std.error = exp(std.error),conf.low=exp(conf.low),conf.high=exp(conf.high)) %>%
kbl(col.names=c("Term","Estimate","Standard Error","t-statistic","p-value","Lower CI","Upper CI")) %>%
kable_styling()
| Term | Estimate | Standard Error | t-statistic | p-value | Lower CI | Upper CI |
|---|---|---|---|---|---|---|
| (Intercept) | 2308.795 | 1.05 | 169.69 | 0.000 | 2111.231 | 2524.847 |
| prop_type_simplifiedCondominium | 1.036 | 1.01 | 2.82 | 0.005 | 1.011 | 1.062 |
| prop_type_simplifiedHouse | 1.316 | 1.02 | 17.81 | 0.000 | 1.277 | 1.357 |
| prop_type_simplifiedOthers | 1.566 | 1.02 | 27.65 | 0.000 | 1.517 | 1.617 |
| prop_type_simplifiedTownhouse | 1.431 | 1.02 | 15.57 | 0.000 | 1.368 | 1.497 |
| number_of_reviews | 1.001 | 1.00 | 10.71 | 0.000 | 1.001 | 1.001 |
| review_scores_rating | 1.004 | 1.00 | 7.69 | 0.000 | 1.003 | 1.005 |
| room_typeHotel room | 1.309 | 1.08 | 3.56 | 0.000 | 1.129 | 1.518 |
| room_typePrivate room | 0.539 | 1.01 | -76.22 | 0.000 | 0.531 | 0.548 |
| room_typeShared room | 0.410 | 1.05 | -16.78 | 0.000 | 0.369 | 0.455 |
model2 %>%
glance() %>%
select(1:6) %>%
kbl(col.names=c("Model R-Squared","Adjusted R-Squared","Sigma","t-statistic","p-value","Degrees of Freedom")) %>%
kable_styling()
| Model R-Squared | Adjusted R-Squared | Sigma | t-statistic | p-value | Degrees of Freedom |
|---|---|---|---|---|---|
| 0.242 | 0.241 | 0.455 | 833 | 0 | 9 |
room_type
Applying the same interpretation methodology as for model1, we find that (ceteris paribus) relative to listings for an entire home or apartment, stays in Private Rooms or Shared Rooms are associated on average with a 46%, or 59% lower stay price, respectively. Contrastingly, stays in hotel rooms are associated, on average, with a 31% higher stay price relative to listings for an entire home or apartment. All three of these coefficients are statistically significant at all conventional significance levels.
Model 2 Performance
We find that the adjusted R-squared of our second model is considerably higher than that of our first (4.6%), at 24%. This means that the addition of room_type has substantially increased the predictive power of our model, since it now explains a quarter of variance in the price of our stay. In order to try to improve the predictive power of our model, we add explanatory variables, including variables for the number of bathrooms, bedrooms, and beds, and house size.
Model 2 Diagnostics
autoplot(model2, alpha = 0.3, label.size = 3) +
theme_bw()

Much like in Model 1, in Model 2 we find that the residuals are patterned, indicating that there is a pattern (or multiple patterns) in our data which remains unaccounted for by the model itself. Again, we also find that though the residuals follow approximately a normal distribution. In terms of scale-location, the equal (or constant) variance assumption seems to hold less for Model 2 than Model 1, since at the upper and lower limits of fitted values there seems to be a negative, followed by a more extreme positive kink in the standardised residuals. Further, we identify quite a few observations with high absolute residuals, again indicating that these observations may have undue influence on estimates of model parameters.
kbl(vif(model2), align="l")%>%
kable_styling(full_width = F)
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| prop_type_simplified | 1.05 | 4 | 1.01 |
| number_of_reviews | 1.02 | 1 | 1.01 |
| review_scores_rating | 1.01 | 1 | 1.00 |
| room_type | 1.07 | 3 | 1.01 |
Again, in Model 2, all VIF shown are less than 5, and thus we can safely conclude that this model does not have collinearity problems, and hence no variables require removal.
rmse_train<-
listings_train %>%
mutate(predictions = exp(predict(model2, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
rmse_test <-
listings_test %>%
mutate(predictions = exp(predict(model2, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
r2_train <-
summary(model2)$r.squared
r2_test <-
cor(predict(model2, listings_test), listings_test$price_4_nights)
kbl(matrix(c(rmse_train, rmse_test, r2_train, r2_test),
nrow = 2,
dimnames = list(c("train","test"),
c("RMSE","R-squared"))),
) %>%
kable_styling()
| RMSE | R-squared | |
|---|---|---|
| train | 54463 | 0.242 |
| test | 44868 | 0.135 |
On comparison of the RMSE and R2 of our second model, we find that, again, the error associated with our training data is greater than that of the model on test data, which may be an artifact of small testing sample and large training sample. Despite this, the substantial difference in RMSE and also the fact R2 for the model on training data (24.2%) is almost double that of the R2 on test data (13.5%) suggests that the model suffers from overfitting.
Now we explore whether bathrooms, bedrooms, beds and house size (accommodates) are significant predictors of the price of our stay (price_4_nights)
#see Model1 for detailed annotation of regression model creation and diagnostic stages
model3 <- lm(log(price_4_nights) ~ prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bathrooms +
bedrooms +
beds +
accommodates, data = listings_necessary)
model3 %>%
tidy(conf.int = TRUE) %>%
mutate(estimate=exp(estimate), std.error = exp(std.error),conf.low=exp(conf.low),conf.high=exp(conf.high)) %>%
kbl(col.names=c("Term","Estimate","Standard Error","t-statistic","p-value","Lower CI","Upper CI")) %>%
kable_styling()
| Term | Estimate | Standard Error | t-statistic | p-value | Lower CI | Upper CI |
|---|---|---|---|---|---|---|
| (Intercept) | 1174.003 | 1.04 | 167.787 | 0.000 | 1080.960 | 1275.054 |
| prop_type_simplifiedCondominium | 1.026 | 1.01 | 2.262 | 0.024 | 1.003 | 1.048 |
| prop_type_simplifiedHouse | 0.921 | 1.01 | -5.584 | 0.000 | 0.895 | 0.948 |
| prop_type_simplifiedOthers | 1.359 | 1.01 | 21.055 | 0.000 | 1.321 | 1.399 |
| prop_type_simplifiedTownhouse | 1.018 | 1.02 | 0.832 | 0.406 | 0.976 | 1.061 |
| number_of_reviews | 1.001 | 1.00 | 9.514 | 0.000 | 1.001 | 1.001 |
| review_scores_rating | 1.004 | 1.00 | 10.200 | 0.000 | 1.004 | 1.005 |
| room_typeHotel room | 1.596 | 1.07 | 6.930 | 0.000 | 1.398 | 1.822 |
| room_typePrivate room | 0.666 | 1.01 | -51.947 | 0.000 | 0.656 | 0.677 |
| room_typeShared room | 0.527 | 1.05 | -13.474 | 0.000 | 0.481 | 0.579 |
| bathrooms | 1.165 | 1.01 | 13.907 | 0.000 | 1.140 | 1.191 |
| bedrooms | 1.046 | 1.00 | 13.750 | 0.000 | 1.039 | 1.053 |
| beds | 1.008 | 1.00 | 2.690 | 0.007 | 1.002 | 1.014 |
| accommodates | 1.117 | 1.00 | 38.574 | 0.000 | 1.110 | 1.123 |
model3 %>%
glance() %>%
select(1:6) %>%
kbl(col.names=c("Model R-Squared","Adjusted R-Squared","Sigma","t-statistic","p-value","Degrees of Freedom")) %>%
kable_styling()
| Model R-Squared | Adjusted R-Squared | Sigma | t-statistic | p-value | Degrees of Freedom |
|---|---|---|---|---|---|
| 0.398 | 0.398 | 0.406 | 1199 | 0 | 13 |
We find that each of the explanatory variables added (bathrooms, bedrooms, beds and accommodates) have statistically significant coefficients at all conventional significance levels. This means that each is a statistically significant predictor of our outcome variable, the price of a 4 night stay for 2 guests in Copenhagen.
Here, we offer a brief analysis of the coefficients on each variable.
bathrooms
The coefficient on bathrooms demonstrates that ceteris paribus, each additional bathroom in an Airbnb is associated with, on average, a 16.5% higher stay price.
bedrooms
The coefficient on bedrooms demonstrates that ceteris paribus, each additional bedroom in an Airbnb is associated with, on average, a 4.6% higher stay price.
beds
The coefficient on beds demonstrates that ceteris paribus, each additional bedroom in an Airbnb is associated with, on average, a 0.8% higher stay price.
accommodates
The coefficient on bedrooms demonstrates that ceteris paribus, each additional bedroom in an Airbnb is associated with, on average, a 11.7% higher stay price.
Model 3 Performance
Again, we find that the addition of these variables has increased the adjusted R-squared of our model. Our model now explains approximately 40% of variation in the price of our stay, which despite being suboptimal for accurate prediction, is now considerably better than could be obtained by chance or by the mean and a substantial improvement from Model 2, which explained just 24% of variance. It is evident that so far, by adding explanatory variables, we have improved our model considerably. Therefore, we continue to adjust our model iteratively to investigate whether other features of Copenhagen AirBnB listings are statistically significant predictors of the price of our stay.
Model 3 Diagnostics
autoplot(model3, alpha = 0.3, label.size = 3) +
theme_bw()
We find that the residuals are patterned, since they are clustered at the lower end of the fitted values, indicating that there is a pattern in our data which Model 3 still fails to account for. Further, we find that the residuals do not generally follow a quasi-normal distribution, which indicates our linear regression normality assumption may be untenable. In terms of scale-location, the equal (or constant) variance assumption appears to hold, since the positive trend across the fitted values is not observed when we take out instance 17340, as in the residuals vs fitted plot. Finally, we again identify quite a few observations with high absolute residuals, indicating that these observations may have undue influence on estimates of model parameters.
kbl(vif(model3), align="l")%>%
kable_styling(full_width = F)
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| prop_type_simplified | 1.28 | 4 | 1.03 |
| number_of_reviews | 1.02 | 1 | 1.01 |
| review_scores_rating | 1.01 | 1 | 1.00 |
| room_type | 1.26 | 3 | 1.04 |
| bathrooms | 1.26 | 1 | 1.12 |
| bedrooms | 1.74 | 1 | 1.32 |
| beds | 2.41 | 1 | 1.55 |
| accommodates | 2.97 | 1 | 1.72 |
As in Models 1 and 2, all VIF shown are less than 5, and thus we can safely conclude that this model does not have collinearity problems despite the VIF on beds and accommodates being unusually high (for obvious reasons) relative to the other explanatory variables. Thus, no variables require removal.
rmse_train<-
listings_train %>%
mutate(predictions = exp(predict(model3, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
rmse_test <-
listings_test %>%
mutate(predictions = exp(predict(model3, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
r2_train <-
summary(model3)$r.squared
r2_test <-
cor(predict(model3, listings_test), listings_test$price_4_nights)
kbl(matrix(c(rmse_train, rmse_test, r2_train, r2_test),
nrow = 2,
dimnames = list(c("train","test"),
c("RMSE","R-Squared"))),
) %>%
kable_styling()
| RMSE | R-Squared | |
|---|---|---|
| train | 39777 | 0.398 |
| test | 38709 | 0.231 |
On comparison of the RMSE and adjusted R-squared of our third model, we find that, again, the error associated with our training data is greater than that of the model on test data. Much like for Model 2, we do however also find that the R2 for the model on training data (39.8%) is considerably higher than that of the R2 on test data (23.1%), which suggests that our model still suffers from overfitting.
We now evaluate whether Superhosts command a pricing premium, as we continue to control for all previous explanatory variables
#see Model1 for detailed annotation of regression model creation and diagnostic stages
model4 <- lm(log(price_4_nights) ~ prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bathrooms +
bedrooms +
beds +
accommodates +
host_is_superhost, data = listings_necessary)
model4 %>%
tidy(conf.int = TRUE) %>%
mutate(estimate=exp(estimate), std.error = exp(std.error),conf.low=exp(conf.low),conf.high=exp(conf.high)) %>%
kbl(col.names=c("Term","Estimate","Standard Error","t-statistic","p-value","Lower CI","Upper CI")) %>%
kable_styling()
| Term | Estimate | Standard Error | t-statistic | p-value | Lower CI | Upper CI |
|---|---|---|---|---|---|---|
| (Intercept) | 1221.273 | 1.04 | 167.920 | 0.000 | 1124.040 | 1326.917 |
| prop_type_simplifiedCondominium | 1.024 | 1.01 | 2.129 | 0.033 | 1.002 | 1.047 |
| prop_type_simplifiedHouse | 0.922 | 1.01 | -5.578 | 0.000 | 0.895 | 0.948 |
| prop_type_simplifiedOthers | 1.358 | 1.01 | 21.005 | 0.000 | 1.320 | 1.397 |
| prop_type_simplifiedTownhouse | 1.015 | 1.02 | 0.723 | 0.470 | 0.974 | 1.058 |
| number_of_reviews | 1.001 | 1.00 | 6.489 | 0.000 | 1.000 | 1.001 |
| review_scores_rating | 1.004 | 1.00 | 9.101 | 0.000 | 1.003 | 1.005 |
| room_typeHotel room | 1.603 | 1.07 | 7.005 | 0.000 | 1.405 | 1.829 |
| room_typePrivate room | 0.664 | 1.01 | -52.433 | 0.000 | 0.654 | 0.674 |
| room_typeShared room | 0.528 | 1.05 | -13.463 | 0.000 | 0.481 | 0.580 |
| bathrooms | 1.165 | 1.01 | 13.905 | 0.000 | 1.140 | 1.190 |
| bedrooms | 1.046 | 1.00 | 13.805 | 0.000 | 1.039 | 1.053 |
| beds | 1.009 | 1.00 | 2.872 | 0.004 | 1.003 | 1.014 |
| accommodates | 1.116 | 1.00 | 38.387 | 0.000 | 1.110 | 1.122 |
| host_is_superhostTRUE | 1.077 | 1.01 | 8.374 | 0.000 | 1.058 | 1.096 |
model4 %>%
glance() %>%
select(1:6) %>%
kbl(col.names=c("Model R-Squared","Adjusted R-Squared","Sigma","t-statistic","p-value","Degrees of Freedom")) %>%
kable_styling()
| Model R-Squared | Adjusted R-Squared | Sigma | t-statistic | p-value | Degrees of Freedom |
|---|---|---|---|---|---|
| 0.4 | 0.4 | 0.405 | 1121 | 0 | 14 |
Superhost
Our fourth regression model demonstrates clearly that whether the Airbnb host is a superhost is a statistically significant predictor of the price of our stay at all conventional significance levels. Specifically, we find that ceteris paribus, on average, if the host is a superhost (versus not being a superhost), the price of our stay is 7.7% higher.
Model 4 Performance
Though the newly included superhost predictor is statistically significant, the performance of this model has only improved marginally from an adjusted R-squared of 39.8% to approximately 40.0%, and thus this explanatory variable has not substantively improved the proportion of variance in price_4_nights explained by our model. Therefore, we continue to explore further predictors which may improve the predictive power of our model.
Model 4 Diagnostics
autoplot(model4, alpha = 0.3, label.size = 3) +
theme_bw()
We find that the residuals are patterned, since they cluster in the lower range of fitted values, indicating that there is a pattern in our data which is still not accounted for by the model itself. Again, the residuals follow approximately a normal distribution. In terms of scale-location, the equal (or constant) variance assumption seems to hold, since the is positive trend across the fitted values is predominantly associated to one outlier. Finally, as in all previous models (Models 1-3), we identify quite a few observations with high standardized residuals, indicating that these observations may have undue influence on estimates of model parameters.
kbl(vif(model4), align="l")%>%
kable_styling(full_width = F)
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| prop_type_simplified | 1.28 | 4 | 1.03 |
| number_of_reviews | 1.13 | 1 | 1.06 |
| review_scores_rating | 1.03 | 1 | 1.01 |
| room_type | 1.27 | 3 | 1.04 |
| bathrooms | 1.26 | 1 | 1.12 |
| bedrooms | 1.74 | 1 | 1.32 |
| beds | 2.41 | 1 | 1.55 |
| accommodates | 2.97 | 1 | 1.72 |
| host_is_superhost | 1.13 | 1 | 1.06 |
As in Models 1, 2 and 3, all VIF shown are less than 5, and thus we can safely conclude that this model does not have collinearity problems despite the VIF on beds and accommodates being unusually high, as we observed in Model 3 diagnostics. Thus, no variables require removal at this stage.
rmse_train<-
listings_train %>%
mutate(predictions = exp(predict(model4, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
rmse_test <-
listings_test %>%
mutate(predictions = exp(predict(model4, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
r2_train <-
summary(model4)$r.squared
r2_test <-
cor(predict(model4, listings_test), listings_test$price_4_nights)
kbl(matrix(c(rmse_train, rmse_test, r2_train, r2_test),
nrow = 2,
dimnames = list(c("train","test"),
c("RMSE","R-Squared"))),
) %>%
kable_styling()
| RMSE | R-Squared | |
|---|---|---|
| train | 39651 | 0.400 |
| test | 38589 | 0.231 |
On comparison of the RMSE and R2 of our fourth model, we find that, again, the error associated with our training data is greater than that of the model on test data. Much like for Models 2 and 3, we do however also find that the R2 for the model on training data (40.0%) is considerably higher than that of the R2 on test data (23.1%), which suggests that our model still suffers from overfitting.
Subsequently, we investigate whether the listing’s location (exact or inexact) is a significant predictor of the price of the stay
Logically, we may expect that if an exact location is provided, guests may be willing to pay a higher price for their stay simply for the reduced uncertainty (or security) this information provides them ahead of their visit.
#see Model1 for detailed annotation of regression model creation and diagnostic stages
model5 <- lm(log(price_4_nights) ~ prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bathrooms +
bedrooms +
beds +
accommodates +
host_is_superhost +
is_location_exact, data = listings_necessary)
# exact location is not a significant predictor, so we exclude it in our next step regression
model5 %>%
tidy(conf.int = TRUE) %>%
mutate(estimate=exp(estimate), std.error = exp(std.error),conf.low=exp(conf.low),conf.high=exp(conf.high)) %>%
kbl(col.names=c("Term","Estimate","Standard Error","t-statistic","p-value","Lower CI","Upper CI")) %>%
kable_styling()
| Term | Estimate | Standard Error | t-statistic | p-value | Lower CI | Upper CI |
|---|---|---|---|---|---|---|
| (Intercept) | 1227.822 | 1.04 | 167.235 | 0.000 | 1129.612 | 1334.569 |
| prop_type_simplifiedCondominium | 1.024 | 1.01 | 2.099 | 0.036 | 1.002 | 1.046 |
| prop_type_simplifiedHouse | 0.922 | 1.01 | -5.582 | 0.000 | 0.895 | 0.948 |
| prop_type_simplifiedOthers | 1.355 | 1.01 | 20.711 | 0.000 | 1.316 | 1.394 |
| prop_type_simplifiedTownhouse | 1.015 | 1.02 | 0.682 | 0.495 | 0.973 | 1.058 |
| number_of_reviews | 1.001 | 1.00 | 6.512 | 0.000 | 1.000 | 1.001 |
| review_scores_rating | 1.004 | 1.00 | 9.123 | 0.000 | 1.003 | 1.005 |
| room_typeHotel room | 1.607 | 1.07 | 7.042 | 0.000 | 1.408 | 1.834 |
| room_typePrivate room | 0.664 | 1.01 | -52.438 | 0.000 | 0.654 | 0.674 |
| room_typeShared room | 0.528 | 1.05 | -13.455 | 0.000 | 0.482 | 0.580 |
| bathrooms | 1.165 | 1.01 | 13.909 | 0.000 | 1.140 | 1.190 |
| bedrooms | 1.046 | 1.00 | 13.800 | 0.000 | 1.039 | 1.053 |
| beds | 1.009 | 1.00 | 2.900 | 0.004 | 1.003 | 1.014 |
| accommodates | 1.116 | 1.00 | 38.400 | 0.000 | 1.110 | 1.122 |
| host_is_superhostTRUE | 1.078 | 1.01 | 8.417 | 0.000 | 1.059 | 1.096 |
| is_location_exactTRUE | 0.992 | 1.01 | -1.279 | 0.201 | 0.979 | 1.004 |
model5 %>%
glance() %>%
select(1:6) %>%
kbl(col.names=c("Model R-Squared","Adjusted R-Squared","Sigma","t-statistic","p-value","Degrees of Freedom")) %>%
kable_styling()
| Model R-Squared | Adjusted R-Squared | Sigma | t-statistic | p-value | Degrees of Freedom |
|---|---|---|---|---|---|
| 0.4 | 0.4 | 0.405 | 1047 | 0 | 15 |
exact location
We can see clearly that the p-value (0.20) for exact location is not less than 0.05, our alpha level, and thus this is not a statistically significant predictor of the price of our stay. Counter to our expectations, this may be the case since Copenhagen has strict property rental, purchase and letting legislation, which is regularly and heavily enforced. As such, the proability of a “false” listing, and thus of any risk or uncertainty associated with location, is low - and thus hosts may see no price premium in providing the exact location relative to cities with less regulation on property services like Airbnb.
Model 5 Performance
Re-evaluating our model’s performance, we find that our adjusted R squared is effectively unchanged (at 40.5%, from 40.0% in Model 4), which is to be expected since exact location is not a statistically significant predictor of our outcome variable, price_4_nights.
Having included explanatory variables for key characteristics of Airbnb listings which we would expect to be closely related to the price of our Airbnb stay, we now turn to evaluating the relationship between price and location more precisely.
Since exact location is not a statistically significant predictor of the price of our stay and does not substantially increase the adjusted R squared of our model, we do not include it in subsequent regressions.
Model 5 Diagnostics
autoplot(model5, alpha = 0.3, label.size = 3) +
theme_bw()
Again, we find that the residuals are patterned (though seemingly less so than in Model 4), since they cluster in the lower range of fitted values but not as extremely. This indicates that there are still underlying patterns in our data which are not accounted for by the model itself. Further, we observe the residuals follow approximately a normal distribution. In terms of scale-location, the equal (or constant) variance assumption also seems to hold, just like in Models 3 and 4, since there is a no clear positive trend across the fitted values if we take out the outliers. Finally, as in all previous models (Models 1-4), we identify quite a few observations with high absolute residuals, indicating that these observations may have undue influence on estimates of model parameters.
kbl(vif(model5), align="l")%>%
kable_styling(full_width = F)
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| prop_type_simplified | 1.30 | 4 | 1.03 |
| number_of_reviews | 1.13 | 1 | 1.06 |
| review_scores_rating | 1.03 | 1 | 1.01 |
| room_type | 1.27 | 3 | 1.04 |
| bathrooms | 1.26 | 1 | 1.12 |
| bedrooms | 1.74 | 1 | 1.32 |
| beds | 2.41 | 1 | 1.55 |
| accommodates | 2.97 | 1 | 1.72 |
| host_is_superhost | 1.13 | 1 | 1.06 |
| is_location_exact | 1.02 | 1 | 1.01 |
As before, all VIF shown are less than 5, and thus we can safely conclude that this model does not have collinearity problems. We make no removals due to collinearity, but rather only due to a lack of significance of the is_location_exact predictor.
rmse_train<-
listings_train %>%
mutate(predictions = exp(predict(model5, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
rmse_test <-
listings_test %>%
mutate(predictions = exp(predict(model5, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
r2_train <-
summary(model5)$r.squared
r2_test <-
cor(predict(model5, listings_test), listings_test$price_4_nights)
kbl(matrix(c(rmse_train, rmse_test, r2_train, r2_test),
nrow = 2,
dimnames = list(c("train","test"),
c("RMSE","R-Squared"))),
) %>%
kable_styling()
| RMSE | R-Squared | |
|---|---|---|
| train | 39643 | 0.400 |
| test | 38611 | 0.231 |
Whereas Models 1 through 4 demonstrated that the model had a higher RMSE on training data than test data, we now find what would be expected in a traditional case of overfitting: that the model has a lower RMSE on training data and a considerably higher R2, exactly as in Model 4 (40.0% versus 23.1%). Again, and unconventionally, we find the error associated with our training data is greater than that of the model on test data. However, the difference between the RMSE is, as in Model 4, relatively small (the RMSE on the training data is less than 3% higher than that on the test data).
Though our map showing the relative price of stays at Airbnb’s provided a good impression of the geographical distribution of Copenhagen Airbnb stays by price and property type, we now delve deeper into how the price of Airbnb stays varies across distinct neighbourhoods within the city.
First, we specify 5 levels in a variable called “neighbourhood_simplified”, which groups together nearby and demographically neighbourhoods
As an example, we have grouped together Fredriksberg and Vanlose not only because they are adjacent, but because they are both large city districts predominantly characterised by residential property and large parkland, away from the tourist-dominated city centre.
Broadly, we divide city neighborhoods into the categories: North, North West, West, South, and East. Though some of these categories are immediately (geographically) adjacent to one another, where possible we have ensured that the constituents share similar socio-demographic characteristics, and thus there is an undeniably distinct character to each of the 5 neighbourhoods we have specified.
# we group all neighbourhoods into North, East, West, South, Northwest using case_when, for the whole dataset
listings_necessary <- listings_necessary %>%
mutate(neighbourhood_simplified = case_when(
neighbourhood_cleansed %in% c('Frederiksberg', 'Vanlse') ~ 'East',
neighbourhood_cleansed %in% c('Vesterbro-Kongens Enghave', 'Indre By ') ~ 'West',
neighbourhood_cleansed %in% c('Amager', 'Amager Vest') ~ 'South',
neighbourhood_cleansed %in% c('Bispebjerg', 'Brnshj-Husum') ~ 'Northwest',
TRUE ~ 'North'
))
# we group all neighbourhoods into North, East, West, South, Northwest using case_when, for the training dataset
listings_train <- listings_train %>%
mutate(neighbourhood_simplified = case_when(
neighbourhood_cleansed %in% c('Frederiksberg', 'Vanlse') ~ 'East',
neighbourhood_cleansed %in% c('Vesterbro-Kongens Enghave', 'Indre By ') ~ 'West',
neighbourhood_cleansed %in% c('Amager', 'Amager Vest') ~ 'South',
neighbourhood_cleansed %in% c('Bispebjerg', 'Brnshj-Husum') ~ 'Northwest',
TRUE ~ 'North'
))
# we group all neighbourhoods into North, East, West, South, Northwest using case_when, for the testing dataset
listings_test <- listings_test %>%
mutate(neighbourhood_simplified = case_when(
neighbourhood_cleansed %in% c('Frederiksberg', 'Vanlse') ~ 'East',
neighbourhood_cleansed %in% c('Vesterbro-Kongens Enghave', 'Indre By ') ~ 'West',
neighbourhood_cleansed %in% c('Amager', 'Amager Vest') ~ 'South',
neighbourhood_cleansed %in% c('Bispebjerg', 'Brnshj-Husum') ~ 'Northwest',
TRUE ~ 'North'
))
Next, we use this new simplified neighbourhood variable in a new regression to determine whether it is a statistically significant predictor of the price of our stay and can improve the explained variation of our model
#see Model1 for detailed annotation of regression model creation and diagnostic stages
#we run a regression with all model4 variables, as well as the new neighbourhood_simplified explanatory variable
model6 <- lm(log(price_4_nights) ~ prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bathrooms +
bedrooms +
beds +
accommodates +
host_is_superhost +
neighbourhood_simplified,
data = listings_necessary)
#we find that grouped neighbourhood is a significant predictor
model6 %>%
tidy(conf.int = TRUE) %>%
mutate(estimate=exp(estimate), std.error = exp(std.error),conf.low=exp(conf.low),conf.high=exp(conf.high)) %>%
kbl(col.names=c("Term","Estimate","Standard Error","t-statistic","p-value","Lower CI","Upper CI")) %>%
kable_styling()
| Term | Estimate | Standard Error | t-statistic | p-value | Lower CI | Upper CI |
|---|---|---|---|---|---|---|
| (Intercept) | 1194.863 | 1.04 | 168.687 | 0.000 | 1100.426 | 1297.404 |
| prop_type_simplifiedCondominium | 1.028 | 1.01 | 2.502 | 0.012 | 1.006 | 1.050 |
| prop_type_simplifiedHouse | 0.970 | 1.01 | -2.091 | 0.037 | 0.943 | 0.998 |
| prop_type_simplifiedOthers | 1.362 | 1.01 | 21.662 | 0.000 | 1.325 | 1.401 |
| prop_type_simplifiedTownhouse | 1.016 | 1.02 | 0.773 | 0.440 | 0.976 | 1.059 |
| number_of_reviews | 1.000 | 1.00 | 5.005 | 0.000 | 1.000 | 1.001 |
| review_scores_rating | 1.004 | 1.00 | 8.826 | 0.000 | 1.003 | 1.005 |
| room_typeHotel room | 1.587 | 1.07 | 6.996 | 0.000 | 1.394 | 1.806 |
| room_typePrivate room | 0.670 | 1.01 | -52.315 | 0.000 | 0.660 | 0.680 |
| room_typeShared room | 0.546 | 1.05 | -13.022 | 0.000 | 0.498 | 0.598 |
| bathrooms | 1.163 | 1.01 | 14.051 | 0.000 | 1.139 | 1.188 |
| bedrooms | 1.044 | 1.00 | 13.645 | 0.000 | 1.038 | 1.051 |
| beds | 1.009 | 1.00 | 2.963 | 0.003 | 1.003 | 1.014 |
| accommodates | 1.117 | 1.00 | 39.519 | 0.000 | 1.111 | 1.123 |
| host_is_superhostTRUE | 1.073 | 1.01 | 8.106 | 0.000 | 1.055 | 1.091 |
| neighbourhood_simplifiedNorth | 1.071 | 1.01 | 8.939 | 0.000 | 1.055 | 1.088 |
| neighbourhood_simplifiedNorthwest | 0.781 | 1.01 | -20.380 | 0.000 | 0.763 | 0.800 |
| neighbourhood_simplifiedSouth | 1.046 | 1.01 | 4.069 | 0.000 | 1.024 | 1.069 |
| neighbourhood_simplifiedWest | 1.097 | 1.01 | 9.710 | 0.000 | 1.076 | 1.117 |
model6 %>%
glance() %>%
select(1:6) %>%
kbl(col.names=c("Model R-Squared","Adjusted R-Squared","Sigma","t-statistic","p-value","Degrees of Freedom")) %>%
kable_styling()
| Model R-Squared | Adjusted R-Squared | Sigma | t-statistic | p-value | Degrees of Freedom |
|---|---|---|---|---|---|
| 0.424 | 0.424 | 0.397 | 963 | 0 | 18 |
neighbourhood_simplified
This regression demonstrates how, across all levels and at all conventional significance levels, neighbourhood_simplified is a statistically significant predictor of the price of our (4 night) stay for 2 guests in Copenhaghen. Notably, relative to Airbnb listings in the East (Fredriksberg and Vanlose), those in the North West are associated with lower priced stays, while those in the South, North and West are associated with higher priced stays, each to an increasing degree. While, ceteris paribus, the average Airbnb listed in the North West are on average 22% cheaper than those listed in the East, those listed in the South, North and West are on average 5%, 7% and 10% more expensive, respectively.
This is to be expected, since the North West region we specified consists of Bispebjerg and Bronshoj-Husum, both of which are extensive suburbs consisting largely of single family detached homes. Each require a 20 minute drive or 30 minute train to get into the city, which for most visitors is enough of a reason either to seek Airbnbs more centrally in the city, or pay considerably less for their stay in these areas. The West is also an obvious and expected outlier. Since this contains the majority of the most central parts of the city via the neighbourhood Indre By, spanning Nyhavn harbour, Tivoli gardens and most of the cities sights, the average price is notably (and justifiably) higher with no need to commute and bars and restaurants on the ground floor of most properties.
Model 6 Performance
Despite the addition of statistically significant geographical predictors, we find that the adjusted R-squared of our model has not increased greatly since our Model 4 iteration, which had an adjusted R-squared of 40.0%. At just 42.9%, our model still explains less than half of the variation in the price_4_nights variable (the price of our Copenhagen stay). In an effort to do better, we consider whether listings which provide generous cancellation policies (and therefore reduce uncertainty surrounding guests’ visit to the city) are associated with higher priced stays.
Model 6 Diagnostics
#see Model1 for detailed annotation of regression model creation and diagnostic stages
autoplot(model6, alpha = 0.3, label.size = 3) +
theme_bw()
Again, we find that the residuals are less patterned, compared with previous models, they cluster in the lower range of fitted values and the trend line has a negative gradient deviating from Y=0. This indicates that despite our best efforts, there are remaining underlying patterns in our data which are not accounted for by the model itself, ratifying our underwhelming 40.0% R-squared. Further, the residuals roughly follow a normal distribution.
In terms of scale-location, the equal (or constant) variance assumption also seems to hold, just like in Models 3-5. Finally, though in all previous models we identified quite a few observations with high absolute residuals, there seems to be heavier clustering towards 0, indicating that while some observations may have undue influence on estimates of model parameters, this may be less problematic than in earlier iterations.
#see Model1 for detailed annotation of regression model creation and diagnostic stages
kbl(vif(model6), align="l")%>%
kable_styling(full_width = F)
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| prop_type_simplified | 1.31 | 4 | 1.03 |
| number_of_reviews | 1.13 | 1 | 1.06 |
| review_scores_rating | 1.03 | 1 | 1.01 |
| room_type | 1.27 | 3 | 1.04 |
| bathrooms | 1.26 | 1 | 1.12 |
| bedrooms | 1.74 | 1 | 1.32 |
| beds | 2.41 | 1 | 1.55 |
| accommodates | 2.97 | 1 | 1.72 |
| host_is_superhost | 1.13 | 1 | 1.06 |
| neighbourhood_simplified | 1.04 | 4 | 1.01 |
As before, all VIF shown are less than 5, and thus this model does not appear to have collinearity problems.
#see Model1 for detailed annotation of regression model creation and diagnostic stages
rmse_train<-
listings_train %>%
mutate(predictions = exp(predict(model6, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
rmse_test <-
listings_test %>%
mutate(predictions = exp(predict(model6, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
r2_train <-
summary(model6)$r.squared
r2_test <-
cor(predict(model6, listings_test), listings_test$price_4_nights)
kbl(matrix(c(rmse_train, rmse_test, r2_train, r2_test),
nrow = 2,
dimnames = list(c("train","test"),
c("RMSE","R-Squared"))),
) %>%
kable_styling()
| RMSE | R-Squared | |
|---|---|---|
| train | 38809 | 0.424 |
| test | 37847 | 0.237 |
This comparison demonstrates a similar trend to Models 1-4, where the training data had a higher RMSE than on the test data, however relative to both Models 4 and 5, Model 6 now shows a larger difference between the RMSEs on the two datasets, indicating that overfitting (albeit of a potentially unconventional form, as discussed in Model 1 Diagnostics) is potentially worse in this model. Focussing on the R2, we find that there is a small increase in the difference between test and training data fit relative to Model 5, as the training R2 is now 43% rather than 40%. Consequently, overfitting appears to be a persistent problem across all of our models.
Now, we think about the relationship between cancellation policies and the price of our stay
#see Model1 for detailed annotation of regression model creation and diagnostic stages
model7 <-
lm(log(price_4_nights) ~ prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bathrooms +
bedrooms +
beds +
accommodates +
host_is_superhost +
neighbourhood_simplified +
cancellation_policy, data = listings_necessary)
# cancellation policy is a significant predictor
model7 %>%
tidy(conf.int = TRUE) %>%
mutate(estimate=exp(estimate), std.error = exp(std.error),conf.low=exp(conf.low),conf.high=exp(conf.high)) %>%
kbl(col.names=c("Term","Estimate","Standard Error","t-statistic","p-value","Lower CI","Upper CI")) %>%
kable_styling()
| Term | Estimate | Standard Error | t-statistic | p-value | Lower CI | Upper CI |
|---|---|---|---|---|---|---|
| (Intercept) | 1144.188 | 1.04 | 168.802 | 0.000 | 1054.346 | 1241.686 |
| prop_type_simplifiedCondominium | 1.023 | 1.01 | 2.071 | 0.038 | 1.001 | 1.045 |
| prop_type_simplifiedHouse | 0.971 | 1.01 | -2.071 | 0.038 | 0.944 | 0.998 |
| prop_type_simplifiedOthers | 1.250 | 1.02 | 12.827 | 0.000 | 1.208 | 1.293 |
| prop_type_simplifiedTownhouse | 1.010 | 1.02 | 0.468 | 0.640 | 0.970 | 1.051 |
| number_of_reviews | 1.000 | 1.00 | 3.000 | 0.003 | 1.000 | 1.000 |
| review_scores_rating | 1.004 | 1.00 | 9.164 | 0.000 | 1.003 | 1.005 |
| room_typeHotel room | 1.795 | 1.07 | 8.823 | 0.000 | 1.576 | 2.044 |
| room_typePrivate room | 0.679 | 1.01 | -50.595 | 0.000 | 0.669 | 0.690 |
| room_typeShared room | 0.554 | 1.05 | -12.815 | 0.000 | 0.506 | 0.606 |
| bathrooms | 1.161 | 1.01 | 13.974 | 0.000 | 1.137 | 1.186 |
| bedrooms | 1.043 | 1.00 | 13.427 | 0.000 | 1.037 | 1.050 |
| beds | 1.011 | 1.00 | 3.739 | 0.000 | 1.005 | 1.016 |
| accommodates | 1.114 | 1.00 | 38.730 | 0.000 | 1.108 | 1.120 |
| host_is_superhostTRUE | 1.067 | 1.01 | 7.544 | 0.000 | 1.049 | 1.085 |
| neighbourhood_simplifiedNorth | 1.067 | 1.01 | 8.485 | 0.000 | 1.051 | 1.083 |
| neighbourhood_simplifiedNorthwest | 0.784 | 1.01 | -20.235 | 0.000 | 0.766 | 0.803 |
| neighbourhood_simplifiedSouth | 1.043 | 1.01 | 3.852 | 0.000 | 1.021 | 1.066 |
| neighbourhood_simplifiedWest | 1.092 | 1.01 | 9.316 | 0.000 | 1.072 | 1.112 |
| cancellation_policymoderate | 1.054 | 1.01 | 8.639 | 0.000 | 1.041 | 1.067 |
| cancellation_policystrict_14_with_grace_period | 1.130 | 1.01 | 18.206 | 0.000 | 1.116 | 1.146 |
| cancellation_policysuper_strict_60 | 1.343 | 1.03 | 10.055 | 0.000 | 1.268 | 1.422 |
model7 %>%
glance() %>%
select(1:6) %>%
kbl(col.names=c("Model R-Squared","Adjusted R-Squared","Sigma","t-statistic","p-value","Degrees of Freedom")) %>%
kable_styling()
| Model R-Squared | Adjusted R-Squared | Sigma | t-statistic | p-value | Degrees of Freedom |
|---|---|---|---|---|---|
| 0.434 | 0.434 | 0.393 | 859 | 0 | 21 |
Cancellation Policy
The regression above shows an interesting and unexpected relationship. Though we may hypothesise that listings with more flexible cancellation policies may be more expensive, since guests may be willing to pay a premium for the ability to cancel whenever they like, this is not the case.
The regression output demonstrates that cancellation policy is a statistically significant predictor of the price of our stay across all levels and at all conventional significance levels. Relative to a flexible cancellation policy, the average price of our Airbnb stay, ceteris paribus, increases as the cancellation policy becomes stricter. This is demonstrated by the fact that moderate cancellation policy is associated on average with a 5.6% higher priced stay than an Airbnb with a flexible policy, increasing to 13.4% and 31.1% for strict and super strict cancellation policies, respectively. Thus, it is evident that more expensive stays are associated with stricter cancellation policies, which makes sense since these properties are probably harder to find guests for and thus need to secure their bookings with certainty, versus cheaper properties in very high demand which can be re-booked within hours of a cancellation.
Model 7 Performance
Again, the addition of the statistically significant cancellation_policy explanatory variable has only improved our adjusted R squared by 1%, from 42.9% to 43.9%. Despite the fact that we have gained a considerably better understanding of the main drivers of the price of our Copenhagen Airbnb stay, we are still explaining less than half of the variation in the price of listings for a 4 night stay with 2 guests.
This suggests that a more systematic approach is required to build a model which explains as much variation in the price of our stay as possible, which is what we now turn to exploring.
Model 7 Diagnostics
#see Model1 for detailed annotation of regression model creation and diagnostic stages
autoplot(model7, alpha = 0.3, label.size = 3) +
theme_bw()

In Model 7, our penultimate model, we find a very similar outcome to our previous diagnostics plots, in that the residuals are less patterned, indicating our most detailed model has captured more key underlying patterns in our data than previous models. Further, the majority of residuals follow the normal distribution as shown in the Normal Q-Q plot
Again, in terms of scale-location, the equal (or constant) variance assumption also seems to hold, just like previous model. Finally, a number of observations with high absolute residuals, indicate that some observations do have undue influence on estimates of model parameters.
Though these diagnostics suggest that the core assumptions of linear regression may not hold perfectly, these deviations are not substantial in most cases. However, we revise our variable selection in an attempt to ensure that as many of these linear regression assumptions as possible do hold, ensuring the validity of outputted coefficients.
#see Model1 for detailed annotation of regression model creation and diagnostic stages
kbl(vif(model7), align="l")%>%
kable_styling(full_width = F)
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| prop_type_simplified | 2.00 | 4 | 1.09 |
| number_of_reviews | 1.16 | 1 | 1.08 |
| review_scores_rating | 1.03 | 1 | 1.01 |
| room_type | 1.32 | 3 | 1.05 |
| bathrooms | 1.26 | 1 | 1.12 |
| bedrooms | 1.74 | 1 | 1.32 |
| beds | 2.41 | 1 | 1.55 |
| accommodates | 2.99 | 1 | 1.73 |
| host_is_superhost | 1.14 | 1 | 1.07 |
| neighbourhood_simplified | 1.05 | 4 | 1.01 |
| cancellation_policy | 1.62 | 3 | 1.08 |
As in all previous models, we ratify earlier findings that there is no concern over collinearity between our variables, since all explanatory variables have VIFs less than 5.
#see Model1 for detailed annotation of regression model creation and diagnostic stages
rmse_train<-
listings_train %>%
mutate(predictions = exp(predict(model7, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
rmse_test <-
listings_test %>%
mutate(predictions = exp(predict(model7, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
r2_train <-
summary(model7)$r.squared
r2_test <-
cor(predict(model7, listings_test), listings_test$price_4_nights)
kbl(matrix(c(rmse_train, rmse_test, r2_train, r2_test),
nrow = 2,
dimnames = list(c("train","test"),
c("RMSE","R-squared"))),
) %>%
kable_styling()
| RMSE | R-squared | |
|---|---|---|
| train | 38328 | 0.434 |
| test | 37760 | 0.240 |
In our penultimate model, we find the greatest difference in terms of R2, where the model explains 44% of variance in our outcome variable (price_4_nights) on training data, and just 23% on our test data. As in Models 1-6, we find that, again, the error associated with our training data is greater than that of the model on test data. Yet, as in Models 3, 4 and 5, this difference is relatively small, implying that overfitting may not be as severe as R2 alone suggests. However, this warrants further investigation into the nature and potential implications of overfitting in this context, which is beyond the scope of the present analyses.
In order to summarise our 7 models so far and their performance, we publish a summary table of the predictors we include in each model, their statistical signficance and logged coefficients, their adjusted R-squared and residual standard error.
#we generate a neatly named summary table showing a comparison of all 7 models
huxreg(model1,model2,model3,model4,model5,model6,model7,
coefs=c("Property Type | Condominium"="prop_type_simplifiedCondominium",
"Property Type | House"="prop_type_simplifiedHouse","Property Type | Others"="prop_type_simplifiedOthers","Property Type | Townhouse"="prop_type_simplifiedTownhouse",
"Number of Reviews"="number_of_reviews",
"Review Scores Rating"="review_scores_rating",
"Room Type | Hotel Room"="room_typeHotel room",
"Room Type | Private Room"="room_typePrivate room",
"Room Type | Shared Room"="room_typeShared room", "Bathrooms"="bathrooms"))
| (1) | (2) | (3) | (4) | (5) | (6) | (7) | |
|---|---|---|---|---|---|---|---|
| Property Type | Condominium | 0.074 *** | 0.035 ** | 0.025 * | 0.024 * | 0.023 * | 0.027 * | 0.022 * |
| (0.016) | (0.013) | (0.011) | (0.011) | (0.011) | (0.011) | (0.011) | |
| Property Type | House | 0.251 *** | 0.275 *** | -0.082 *** | -0.082 *** | -0.082 *** | -0.030 * | -0.030 * |
| (0.020) | (0.015) | (0.015) | (0.015) | (0.015) | (0.014) | (0.014) | |
| Property Type | Others | 0.468 *** | 0.449 *** | 0.307 *** | 0.306 *** | 0.304 *** | 0.309 *** | 0.223 *** |
| (0.021) | (0.016) | (0.015) | (0.015) | (0.015) | (0.014) | (0.017) | |
| Property Type | Townhouse | 0.349 *** | 0.358 *** | 0.018 | 0.015 | 0.014 | 0.016 | 0.010 |
| (0.030) | (0.023) | (0.021) | (0.021) | (0.021) | (0.021) | (0.021) | |
| Number of Reviews | 0.000 | 0.001 *** | 0.001 *** | 0.001 *** | 0.001 *** | 0.000 *** | 0.000 ** |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |
| Review Scores Rating | 0.006 *** | 0.004 *** | 0.004 *** | 0.004 *** | 0.004 *** | 0.004 *** | 0.004 *** |
| (0.001) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |
| Room Type | Hotel Room | 0.269 *** | 0.467 *** | 0.472 *** | 0.475 *** | 0.462 *** | 0.585 *** | |
| (0.076) | (0.067) | (0.067) | (0.067) | (0.066) | (0.066) | ||
| Room Type | Private Room | -0.617 *** | -0.406 *** | -0.410 *** | -0.410 *** | -0.401 *** | -0.387 *** | |
| (0.008) | (0.008) | (0.008) | (0.008) | (0.008) | (0.008) | ||
| Room Type | Shared Room | -0.892 *** | -0.640 *** | -0.638 *** | -0.638 *** | -0.605 *** | -0.591 *** | |
| (0.053) | (0.047) | (0.047) | (0.047) | (0.046) | (0.046) | ||
| Bathrooms | 0.153 *** | 0.153 *** | 0.153 *** | 0.151 *** | 0.149 *** | ||
| (0.011) | (0.011) | (0.011) | (0.011) | (0.011) | |||
| N | 17655 | 23539 | 23539 | 23539 | 23539 | 23539 | 23539 |
| R2 | 0.047 | 0.242 | 0.398 | 0.400 | 0.400 | 0.424 | 0.434 |
| logLik | -13053.633 | -14877.111 | -12150.268 | -12115.234 | -12114.415 | -11634.332 | -11431.993 |
| AIC | 26123.267 | 29776.222 | 24330.536 | 24262.468 | 24262.830 | 23308.663 | 22909.985 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||||||
Based on model7, we want to add more variables to increase predictive power, specifying a model which has the highest possible Adjusted R2, and therefore which explains as much of the variance in the outcome variable, price of a 4-night stay for 2 at an AirBnB in Copenhagen, as possible.
In the amenities variable, we believe WIFI, TV and Internet should matter most for people in today’s highly digitized society. Further, since most of the properties are Apartments and Houses, instead of Hotels, travelers might also expect to have a kitchen (and be willing to pay for it). Therefore we built 4 logical variables indicating whether the property has these amenities:
listings_necessary <- listings_necessary %>%
mutate(has_wifi = grepl('Wifi', amenities),
has_TV = grepl('TV', amenities),
has_internet = grepl('Internet', amenities),
has_kitchen = grepl('Kitchen', amenities))
listings_train <- listings_train %>%
mutate(has_wifi = grepl('Wifi', amenities),
has_TV = grepl('TV', amenities),
has_internet = grepl('Internet', amenities),
has_kitchen = grepl('Kitchen', amenities))
listings_test <- listings_test %>%
mutate(has_wifi = grepl('Wifi', amenities),
has_TV = grepl('TV', amenities),
has_internet = grepl('Internet', amenities),
has_kitchen = grepl('Kitchen', amenities))
Also, the absolute values of bathroom, bedroom and beds are highly correlated with rooms/beds per capacity (as shown in the Listing-Related coefficient matrix from our EDA), and thus in order to avoid problematic multicollinearity, in order to achieve a higher R2 and ensure model simplicity and interpretibility, we only included the absolute values. Further, our previous maps demonstrated that properties are clustered in neighbourhood by their price, and thus we include our simplified neighbourhood variable, too.
Inspired by previous models, we also added reviews of other matters of the property except review_scores_checkin and review_scores_accuracy, since we do not consider them significant and they were not statistically significant in our preliminary tests. Finally, we added more variables to account for the flexibility of the AirBnB booking process, such as instant_bookable.
Further, since we used price, extra_people, guests_included and cleaning_fee to calculate price_4_nights, it makes no sense to include them in our model. this is obvious: once we know these variables, we can calculate price_4_nights straightaway, which does not allow us to understand the underlying drivers of listing prices in the city.
The variables we used in our optimal model can be categorised as follows:
#we therefore build our optimised linear regression model as follows, incorporating variables representing each of these categories:
best_model <- lm(log(price_4_nights) ~
# Facility and Location
prop_type_simplified +
room_type +
bathrooms +
bedrooms +
beds +
accommodates +
neighbourhood_simplified +
# Amenities
has_wifi +
has_TV +
has_internet +
has_kitchen +
# Super-Host
host_is_superhost +
# Property Reviews
review_scores_rating +
review_scores_communication +
review_scores_location +
review_scores_value +
review_scores_cleanliness +
# Flexibility of Property Booking and Terms
instant_bookable +
security_deposit +
cancellation_policy +
minimum_nights,
data = listings_necessary)
We now summarise the output of our model as before:
# Format and display model summary
best_model %>%
tidy(conf.int = TRUE) %>%
mutate(estimate=exp(estimate), std.error = exp(std.error),conf.low=exp(conf.low),conf.high=exp(conf.high)) %>%
kbl(col.names=c("Term","Estimate","Standard Error","t-statistic","p-value","Lower CI","Upper CI")) %>%
kable_styling()
| Term | Estimate | Standard Error | t-statistic | p-value | Lower CI | Upper CI |
|---|---|---|---|---|---|---|
| (Intercept) | 873.446 | 1.06 | 111.868 | 0.000 | 775.715 | 983.490 |
| prop_type_simplifiedCondominium | 1.019 | 1.01 | 1.776 | 0.076 | 0.998 | 1.040 |
| prop_type_simplifiedHouse | 0.974 | 1.01 | -1.878 | 0.060 | 0.947 | 1.001 |
| prop_type_simplifiedOthers | 1.226 | 1.02 | 11.974 | 0.000 | 1.186 | 1.267 |
| prop_type_simplifiedTownhouse | 1.010 | 1.02 | 0.471 | 0.638 | 0.970 | 1.050 |
| room_typeHotel room | 1.580 | 1.07 | 7.030 | 0.000 | 1.391 | 1.795 |
| room_typePrivate room | 0.690 | 1.01 | -47.425 | 0.000 | 0.680 | 0.701 |
| room_typeShared room | 0.560 | 1.05 | -12.874 | 0.000 | 0.513 | 0.612 |
| bathrooms | 1.152 | 1.01 | 13.549 | 0.000 | 1.128 | 1.175 |
| bedrooms | 1.043 | 1.00 | 13.527 | 0.000 | 1.036 | 1.049 |
| beds | 1.011 | 1.00 | 3.811 | 0.000 | 1.005 | 1.016 |
| accommodates | 1.111 | 1.00 | 38.709 | 0.000 | 1.105 | 1.117 |
| neighbourhood_simplifiedNorth | 1.068 | 1.01 | 8.842 | 0.000 | 1.053 | 1.084 |
| neighbourhood_simplifiedNorthwest | 0.827 | 1.01 | -16.070 | 0.000 | 0.808 | 0.846 |
| neighbourhood_simplifiedSouth | 1.042 | 1.01 | 3.857 | 0.000 | 1.021 | 1.064 |
| neighbourhood_simplifiedWest | 1.091 | 1.01 | 9.502 | 0.000 | 1.072 | 1.111 |
| has_wifiTRUE | 1.049 | 1.01 | 3.495 | 0.000 | 1.021 | 1.077 |
| has_TVTRUE | 1.099 | 1.01 | 16.337 | 0.000 | 1.087 | 1.111 |
| has_internetTRUE | 0.978 | 1.01 | -3.824 | 0.000 | 0.967 | 0.989 |
| has_kitchenTRUE | 0.892 | 1.01 | -7.670 | 0.000 | 0.866 | 0.918 |
| host_is_superhostTRUE | 1.054 | 1.01 | 6.565 | 0.000 | 1.038 | 1.071 |
| review_scores_rating | 1.003 | 1.00 | 4.102 | 0.000 | 1.001 | 1.004 |
| review_scores_communication | 0.969 | 1.01 | -4.957 | 0.000 | 0.958 | 0.981 |
| review_scores_location | 1.125 | 1.00 | 25.791 | 0.000 | 1.115 | 1.135 |
| review_scores_value | 0.920 | 1.00 | -16.957 | 0.000 | 0.911 | 0.929 |
| review_scores_cleanliness | 1.042 | 1.00 | 10.640 | 0.000 | 1.034 | 1.050 |
| instant_bookableTRUE | 1.015 | 1.01 | 2.641 | 0.008 | 1.004 | 1.027 |
| security_deposit | 1.000 | 1.00 | 8.996 | 0.000 | 1.000 | 1.000 |
| cancellation_policymoderate | 1.058 | 1.01 | 9.553 | 0.000 | 1.046 | 1.071 |
| cancellation_policystrict_14_with_grace_period | 1.123 | 1.01 | 17.494 | 0.000 | 1.108 | 1.137 |
| cancellation_policysuper_strict_60 | 1.168 | 1.03 | 5.172 | 0.000 | 1.101 | 1.238 |
| minimum_nights | 0.986 | 1.00 | -5.321 | 0.000 | 0.981 | 0.991 |
# Format and display initial model performance analysis
best_model %>%
glance() %>%
select(1:6) %>%
kbl(col.names=c("Model R-Squared","Adjusted R-Squared","Sigma","t-statistic","p-value","Degrees of Freedom")) %>%
kable_styling()
| Model R-Squared | Adjusted R-Squared | Sigma | t-statistic | p-value | Degrees of Freedom |
|---|---|---|---|---|---|
| 0.465 | 0.464 | 0.383 | 658 | 0 | 31 |
We can see that except some factors in property type, all other variables are statistically significant. Our model reaches 46.9% adjusted R2, which is a 3% improvement from model7.
Though our model still only explains 46.9% of variance in the price of the Copenhagen stay, it is also 46.9% better than just taking the mean of all properties as prediction, which is a considerable improvement over many alternative, particularly indiscriminate or random, approaches.
Best Model Comparison
We now report a comparison of our first model, Model 1, alongside our final model before optimisation, Model 7, and then finally our optimised model, created according to the rationale and interrogative investigation throughout this report.
huxreg(model1, model7,best_model,
coefs=c("Property Type | Condominium"="prop_type_simplifiedCondominium",
"Property Type | House"="prop_type_simplifiedHouse",
"Property Type | Others"="prop_type_simplifiedOthers",
"Property Type | Townhouse"="prop_type_simplifiedTownhouse",
"Number of Reviews"="number_of_reviews",
"Review Scores Rating"="review_scores_rating",
"Room Type | Hotel Room"="room_typeHotel room",
"Room Type | Private Room"="room_typePrivate room",
"Room Type | Shared Room"="room_typeShared room",
"Bathrooms"="bathrooms",
"Bedrooms"="bedrooms",
"Beds"="beds",
"Accommodates"="accommodates",
"Host is Super Host | True"="host_is_superhostTRUE",
"Neighbourhood | North"="neighbourhood_simplifiedNorth",
"Neighbourhood | South"="neighbourhood_simplifiedSouth",
"Neighbourhood | North West"="neighbourhood_simplifiedNorthwest",
"Neighbourhood | West"="neighbourhood_simplifiedWest",
"Cancellation Policy | Moderate"="cancellation_policymoderate",
"Cancellation Policy | Strict"="cancellation_policystrict_14_with_grace_period",
"Cancellation Policy | Super Strict"="cancellation_policysuper_strict_60",
"Has Wifi | True"="has_wifiTRUE",
"Has TV | True"="has_TVTRUE",
"Has Internet | True"="has_internetTRUE",
"Has Kitchen | True"="has_kitchenTRUE",
"Review Scores Communication"="review_scores_communication",
"Review Scores Location"="review_scores_location",
"Review Scores Value"="review_scores_value",
"Review Scores Cleanliness"="review_scores_cleanliness",
"Instant Bookable | True"="instant_bookableTRUE",
"Security Deposit"="security_deposit",
"Minimum Nights"="minimum_nights"
))
| (1) | (2) | (3) | |
|---|---|---|---|
| Property Type | Condominium | 0.074 *** | 0.022 * | 0.019 |
| (0.016) | (0.011) | (0.011) | |
| Property Type | House | 0.251 *** | -0.030 * | -0.026 |
| (0.020) | (0.014) | (0.014) | |
| Property Type | Others | 0.468 *** | 0.223 *** | 0.204 *** |
| (0.021) | (0.017) | (0.017) | |
| Property Type | Townhouse | 0.349 *** | 0.010 | 0.009 |
| (0.030) | (0.021) | (0.020) | |
| Number of Reviews | 0.000 | 0.000 ** | |
| (0.000) | (0.000) | ||
| Review Scores Rating | 0.006 *** | 0.004 *** | 0.003 *** |
| (0.001) | (0.000) | (0.001) | |
| Room Type | Hotel Room | 0.585 *** | 0.457 *** | |
| (0.066) | (0.065) | ||
| Room Type | Private Room | -0.387 *** | -0.371 *** | |
| (0.008) | (0.008) | ||
| Room Type | Shared Room | -0.591 *** | -0.580 *** | |
| (0.046) | (0.045) | ||
| Bathrooms | 0.149 *** | 0.141 *** | |
| (0.011) | (0.010) | ||
| Bedrooms | 0.042 *** | 0.042 *** | |
| (0.003) | (0.003) | ||
| Beds | 0.011 *** | 0.011 *** | |
| (0.003) | (0.003) | ||
| Accommodates | 0.108 *** | 0.105 *** | |
| (0.003) | (0.003) | ||
| Host is Super Host | True | 0.065 *** | 0.053 *** | |
| (0.009) | (0.008) | ||
| Neighbourhood | North | 0.065 *** | 0.066 *** | |
| (0.008) | (0.007) | ||
| Neighbourhood | South | 0.042 *** | 0.041 *** | |
| (0.011) | (0.011) | ||
| Neighbourhood | North West | -0.243 *** | -0.190 *** | |
| (0.012) | (0.012) | ||
| Neighbourhood | West | 0.088 *** | 0.087 *** | |
| (0.009) | (0.009) | ||
| Cancellation Policy | Moderate | 0.053 *** | 0.057 *** | |
| (0.006) | (0.006) | ||
| Cancellation Policy | Strict | 0.123 *** | 0.116 *** | |
| (0.007) | (0.007) | ||
| Cancellation Policy | Super Strict | 0.295 *** | 0.155 *** | |
| (0.029) | (0.030) | ||
| Has Wifi | True | 0.048 *** | ||
| (0.014) | |||
| Has TV | True | 0.094 *** | ||
| (0.006) | |||
| Has Internet | True | -0.022 *** | ||
| (0.006) | |||
| Has Kitchen | True | -0.114 *** | ||
| (0.015) | |||
| Review Scores Communication | -0.031 *** | ||
| (0.006) | |||
| Review Scores Location | 0.118 *** | ||
| (0.005) | |||
| Review Scores Value | -0.083 *** | ||
| (0.005) | |||
| Review Scores Cleanliness | 0.041 *** | ||
| (0.004) | |||
| Instant Bookable | True | 0.015 ** | ||
| (0.006) | |||
| Security Deposit | 0.000 *** | ||
| (0.000) | |||
| Minimum Nights | -0.014 *** | ||
| (0.003) | |||
| N | 17655 | 23539 | 23539 |
| R2 | 0.047 | 0.434 | 0.465 |
| logLik | -13053.633 | -11431.993 | -10777.786 |
| AIC | 26123.267 | 22909.985 | 21621.572 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||
#we evaluate whether the 4 core assumptions of linear regression (Linear, Independence, Normal, Equality) hold
autoplot(best_model, alpha = 0.3, label.size = 3) +
theme_bw()

As shown in graph above, we observe less pattern in the residuals vs fitted plot compared with previous models, indicating that the addition of variables increased the amount of variation in price_4_nights captured by our model. Similar to previous models, we can conclude that our residuals follow approximately a normal distribution, as the normal Q-Q plot suggests. Again, in terms of scale-location, the equal (or constant) variance assumption also seems to hold, as there is no clear positive nor negative trends across the fitted values. Finally, a number of observations with high absolute residuals indicate that some observations do have undue influence on estimates of model parameters, though this require further investigation which is beyond the remits of the present analysis.
#we use the Variance Inflation Factor (VIF) to evaluate if multicollinearity is problematic for our model
kbl(vif(best_model), align="l") %>%
kable_styling(full_width = F)
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| prop_type_simplified | 2.04 | 4 | 1.09 |
| room_type | 1.49 | 3 | 1.07 |
| bathrooms | 1.27 | 1 | 1.13 |
| bedrooms | 1.74 | 1 | 1.32 |
| beds | 2.42 | 1 | 1.56 |
| accommodates | 3.01 | 1 | 1.74 |
| neighbourhood_simplified | 1.10 | 4 | 1.01 |
| has_wifi | 1.05 | 1 | 1.03 |
| has_TV | 1.14 | 1 | 1.07 |
| has_internet | 1.07 | 1 | 1.03 |
| has_kitchen | 1.14 | 1 | 1.07 |
| host_is_superhost | 1.05 | 1 | 1.02 |
| review_scores_rating | 2.55 | 1 | 1.60 |
| review_scores_communication | 1.45 | 1 | 1.21 |
| review_scores_location | 1.35 | 1 | 1.16 |
| review_scores_value | 2.04 | 1 | 1.43 |
| review_scores_cleanliness | 1.94 | 1 | 1.39 |
| instant_bookable | 1.09 | 1 | 1.04 |
| security_deposit | 1.13 | 1 | 1.06 |
| cancellation_policy | 1.83 | 3 | 1.11 |
| minimum_nights | 1.10 | 1 | 1.05 |
With all VIFs < 5, it is safe to conclude that our best model does not suffer from collineraity.
rmse_train<-
listings_train %>%
mutate(predictions = exp(predict(best_model, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
rmse_test <-
listings_test %>%
mutate(predictions = exp(predict(best_model, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
r2_train <-
summary(best_model)$r.squared
r2_test <-
cor(predict(best_model, listings_test), listings_test$price_4_nights)
kbl(matrix(c(rmse_train, rmse_test, r2_train, r2_test),
nrow = 2,
dimnames = list(c("train","test"),
c("RMSE","R-squared"))),
) %>%
kable_styling()
| RMSE | R-squared | |
|---|---|---|
| train | 36521 | 0.465 |
| test | 36745 | 0.246 |
In comparing the RMSE and R2 of our final Best Model when applied to Test and Training data, we find that there is a very small difference in RMSE, and as would be expected, the RMSE is lower for the training than the test dataset. Not only is the difference in RMSE the lowest of all the models we have evaluated throughout this analysis, but it is a difference of only 0.60% in the context of the error figures for each model. This indicates that the model is not as overfitted to the training data as the substantially different R2 values would suggest. Indeed, despite the obvious presence of overfitting, we can safely conclude that our Best Model does perform the best of all models surveyed in terms of model fit (as well as in R2), which is reassuring.
Now, we predict the price to live in an apartment with a private room, which has at least 10 reviews, and an average rating of at least 90, for 2 guests over the course of a 4 night stay in Copenhagen
# we select: an apartment with a private room, which has at least 10 reviews, and an average rating of at least 90, to conduct our prediction
subset_of_listings <- listings_necessary %>%
filter(property_type == 'Apartment',
room_type == 'Private room',
review_scores_rating >= 90,
number_of_reviews >= 10) %>%
# we take the exponential for prediction because we used log(y) in our model
mutate(prediction = exp(predict(best_model, .)))
# we construct a confidence interval
upper_ci <- quantile(subset_of_listings$prediction, 0.975)
lower_ci <- quantile(subset_of_listings$prediction, 0.025)
# we calculate the mean
mean_pred <- mean(subset_of_listings$prediction)
# we visualise
ggplot(subset_of_listings, aes(x = prediction)) +
geom_histogram() +
geom_vline(xintercept = mean_pred, size = 0.5, color = 'red') +
geom_vline(xintercept = upper_ci, size = 0.5, color = 'blue') +
geom_vline(xintercept = lower_ci, size = 0.5, color = 'blue') +
annotate(geom = 'text', label = paste("Point Prediction:", round(mean_pred)), x = mean_pred + 50, y = 35, colour="white", angle=270) +
annotate(geom = 'text', label = paste("Upper 95% CI:", round(upper_ci)), x = upper_ci + 50, y = 35, colour="black", angle=270) +
annotate(geom = 'text', label = paste("Lower 95% CI:", round(lower_ci)), x = lower_ci+ 50, y = 35, colour="black", angle=270) +
theme_fivethirtyeight() +
labs(y = "count", x = "Price: 2 Guests for 4 Nights",
title = 'Point Estimate and Confidence Interval for the Price of a Stay in a Copenhagen AirBnB',
subtitle = 'Histogram of estimated price for a 4-Night Stay for 2 Guests in an AirBnB Private Room in an Apartment, rated at least 90 with at least 10 reviews') +
scale_x_continuous(labels = label_dollar())

As the histogram shows, we conclude that our point estimate in an apartment with private room, overall rating of at least 90 and with 10 or more reviews is $1949 USD, with a 95% confidence interval of $1309 to $2597 USD.